Skip to main content Skip to complementary content

Validating data using data quality rules and filtering the invalid data

The Job in this scenario uses:
  • The tDataQualityRules component to validate data using a data quality rule from Talend Cloud Data Stewardship.
  • The tFilterRow component to filter the non-executable data.
The Job uses the following data quality rule:
Configuration of the data quality rule.

You can download and extract the following file: scholarship_programs.zip.

Setting up the Job

Procedure

  1. Drop the following components from the Palette onto the design workspace: tFixedFlowInput, tDataQualityRules, tExtractJSONFields, tFilterRow and three tLogRow components.
  2. Using Main links, connect:
    • tFixedFlowInput to tDataQualityRules
    • tDataQualityRules to one tLogRow
    • tExtractJSONFields to tFilterRow
  3. Using Filter links, connect tFilterRow to another tLogRow.
  4. Using Reject links, connect:
    • tDataQualityRules to tExtractJSONFields
    • tFilterRow to the last tLogRow
    A 7-component Job to validate data using the tDataQualityRules component.
    Information noteTip: If you want valid data only, connect tDataQualityRules to only one tLogRow component using a Main link. You do not need the tExtractJSONFields and tFilterRow components.

Configuring the tFixedFlowInput component

Procedure

  1. Double-click tFixedFlowInput to open its Basic settings view.
  2. Select Built-In as Schema and click […] next to Edit schema.
  3. To add columns, click Add icon and click OK.
    Schema of the tFixedFlowInput component.
  4. Select Use Inline Content (delimited file).
  5. Enter your data. You can use the data from the dataset in the following file: scholarship_programs.zip.
    Configuration of the tFixedFlowInput component.

Configuring the tDataQualityRules component

Before you begin

About this task

You need the Rules - View permission to retrieve the rules. For more information, see the predefined user roles for the app you are using.

Procedure

  1. Select Built-in as Output schema.
  2. To retrieve the schema from the input component, click Sync columns.
  3. From the Application drop-down list, select Data Stewardship.
  4. Enter the URL to Talend Cloud Data Stewardship: "https://tds.us.cloud.talend.com/rulerepository/api/v1".
    The URL depends on your Cloud region. See Talend Cloud regions and URLs.
  5. To enter your personal access token, click [...].
  6. To retrieve the data quality rules, click Refresh.
    Refresh button next to the DQ rule library timestamp field.
    If you update the data quality rules in Talend Cloud Data Stewardship, click Refresh to retrieve the latest version.
  7. Click [...] next to Configure DQ rules.
  8. Select the rule and associate the rule variables to the input columns.
    Configuration window to associate the variables.
    You can associate the data types from Talend Studio with some data types from Talend Cloud Data Stewardship. See Associating data types from the properties.
    If no rules or input columns are available, verify that:
    • Data quality rules have been retrieved in DQ rule library timestamp.
    • The input schema is correct.
  9. To validate the configuration, click OK.

Configuring the tExtractJSONFields component

About this task

The tExtractJSONFields component lets you get more readable data so you can easily filter them using the tFilterRow component.

Procedure

  1. Select Built-In as Property Type and Schema.
  2. Click [...] next to Edit schema.
    The INVALID_STATUS column has been added to the input schema. You can rename the column in the output schema.
    Example of input and output schemas.
  3. Select JsonPath and the API version 2.1.0.
  4. Enter the wildcard * in the Loop Jsonpath query field to select all elements.
  5. Select INVALID_STATUS from the JSON Field drop-down list.
  6. Define the JSON query in the Mapping table.
    The columns in the schema are listed in the Column field of the Mapping table.
  7. In the Json query column of the Mapping table, enter executionResults[0].status for Status.
    executionResults[0].status is the path to status where [0] represents the execution details of the first rule. If you need the execution details of the second rule, use [1], of the third rule, use [2] and so on. This makes easier to filter the data using the tFilterRow component.

    Example

    The INVALID_STATUS is: [{"executionResults":[{"ruleName":"ScholarshipProgram_Else","inputColumn":["EffectiveOn","Code","Program"],"status":"Invalid"}]}].

    tExtractJSONFields transforms it into Invalid.

    Configuration of the tExtractJSONFields component.

Configuring the tFilterRow component

About this task

The tFilterRow component in this scenario creates two categories:
  • NE: The rule cannot be executed on the data.
  • Invalid: The data fulfill the condition but not the validation expression.

Procedure

  1. In Conditions, click Add icon.
  2. In InputColumn, select Status.
  3. In Value, enter "NE".
    Information noteRemember: If you add more than one column, define the logical operator above the table: And/Or. By default, And is selected.

Configuring the tLogRow components

Procedure

  1. Double-click the tLogRow components to open their Basic settings view.
  2. Select Built-in as Schema and click Sync columns.
  3. In the Mode area, select Table (print values in cells of a table).
    Configuration of the tLogRow components.

Saving and executing the Job

Procedure

Save your Job and press F6 to execute it.

Results

The tLogRow from the Main link returns the students whose:
  • Status code is 2632 and the date is 2021-09-01.

    OR

  • Status code is not 2632 and the program ends with 10AB or 10AC.
.-------+----+-------+-----------+-----------.
|                    main                    |
|=------+----+-------+-----------+----------=|
|ID     |Code|Program|EffectiveOn|Nationality|
|=------+----+-------+-----------+----------=|
|1002145|2632|2310AB |2021-09-01 |Canada     |
|1002146|1132|3610AC |2021-10-01 |US         |
|1002149|2632|3711AB |2021-09-01 |US         |
|1002150|2632|2310AB |2021-09-01 |US         |
|1002151|2632|3711AB |2021-09-01 |US         |
|1002152|2632|2310AB |2021-09-01 |US         |
|1002153|2632|3711AB |2021-09-01 |US         |
|1002155|2675|3610AC |2021-10-01 |US         |
|1002156|2675|2310AB |2021-10-01 |Canada     |
|1002158|2632|2310AB |2021-09-01 |US         |
|1002159|1132|2310AB |2021-10-01 |US         |
|1002162|2675|2310AB |2021-10-01 |US         |
|1002163|2632|3610AC |2021-09-01 |US         |
'-------+----+-------+-----------+-----------'
The invalid data is filtered into two tables.
.-------+----+-------+-----------+-----------+------.
|                        NE                         |
|=------+----+-------+-----------+-----------+-----=|
|ID     |Code|Program|EffectiveOn|Nationality|Status|
|=------+----+-------+-----------+-----------+-----=|
|1002161|abcd|3611AC |2021-09-01 |US         |NE    |
'-------+----+-------+-----------+-----------+------'

The NE status means that the rule cannot be executed on the data. In this scenario, the status code is abcd but the rule condition is is equal to 2632.

The other invalid data mean that:
  • Status code is 2632 but the date is not 2021-09-01

    or

  • Status code is not 2632 and the program does not end with 10AB or 10AC.
    .-------+----+-------+-----------+-----------+-------+----------------------------------.
    |                                     OtherInvalid                                      |
    |=------+----+-------+-----------+-----------+-------+---------------------------------=|
    |ID     |Code|Program|EffectiveOn|Nationality|Status |errorMessage                      |
    |=------+----+-------+-----------+-----------+-------+---------------------------------=|
    |1002147|2632|3610AC |2022-01-01 |Canada     |Invalid|Status.compareTo("NE") == 0 failed|
    |1002148|2632|2310AB |2022-01-01 |Canada     |Invalid|Status.compareTo("NE") == 0 failed|
    |1002154|2632|3610AC |2022-01-01 |Canada     |Invalid|Status.compareTo("NE") == 0 failed|
    |1002157|2632|3610AC |2022-01-01 |Canada     |Invalid|Status.compareTo("NE") == 0 failed|
    |1002160|2632|2310AB |2022-01-01 |Canada     |Invalid|Status.compareTo("NE") == 0 failed|
    |1002164|2675|3611AC |2021-09-01 |US         |Invalid|Status.compareTo("NE") == 0 failed|
    '-------+----+-------+-----------+-----------+-------+----------------------------------'
When you save the Job, a JAR file containing the rule definition is created in:
studio-path\configuration\.m2\repository\org\talend\libraries\rules-number

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!