Skip to main content Skip to complementary content

Validating data using data quality rules and dynamic schema

The Job in this scenario uses the tDataQualityRules component to validate data using a data quality rule from Talend Cloud Data Stewardship.

The Job uses a data quality rule in advanced mode:
if (purchase_price >= 2999) {discount_percentage == 0.2}
elseif (purchase_price >= 1999 && purchase_price < 2999) {discount_percentage == 0.15}
elseif (purchase_price >= 999 && purchase_price < 1999) {discount_percentage == 0.10}
elseif (purchase_price >= 599 && purchase_price < 999) {discount_percentage == 0.07}
else {discount_percentage == 0}
Information noteImportant: You can use a dynamic schema in the Standard component. Apache Spark components do not support it.

Setting up the Job

Procedure

  1. Drop the following components from the Palette onto the design workspace: tFileInputDelimited, tDataQualityRules, and two tLogRow components.
  2. Using Main links, connect:
    • tFileInputDelimited to tDataQualityRules
    • tDataQualityRules to one tLogRow
  3. Using Reject links, connect tDataQualityRules to the last tLogRow.
    A Job using the tFileInputDelimited, tDataQualityRules, and two tLogRow components.
    Information noteTip: If you want valid data only, connect tDataQualityRules to only one tLogRow component using a Main link. The second tLogRow component is here to show you the details of the invalid data.

Configuring the tFileInputDelimited component

Procedure

  1. Double-click tFileInputDelimited to open its Basic settings view.
  2. In the File name/Stream field, enter the file path between double quotes.
  3. In this example, the input data is:
    id;price1;discount1;price2;discount2
    1;100;0.2;2000;0.1
    2;1000;0.05;4000;0.2
    3;1999;0.15;4200;0.2
    4;5000;0.2;3000;0.2
    5;300;0;30;0
    6;6000;0.2;3000;0.1
    7;price;;;
  4. Select Built-In as Schema and click […] next to Edit schema.
  5. To add columns, click Add icon and click OK.
    Schema of the tFileInputDelimited component.

    Dynamic schema retrieves the columns which are undefined in the schema. That is why you do not need to define the discount1, price2, discount2 columns in the schema.

    Information noteImportant: You can use a dynamic schema in the Standard component. Apache Spark components do not support it.

Configuring the tDataQualityRules component

Before you begin

About this task

You need the Rules - View permission to retrieve the rules. For more information, see the predefined user roles for the app you are using.

Procedure

  1. Select Built-in as Output schema.
  2. To retrieve the schema from the input component, click Sync columns.
  3. From the Application drop-down list, select Data Stewardship.
  4. Enter the URL to Talend Cloud Data Stewardship: "https://tds.us.cloud.talend.com/".
    The URL depends on your Cloud region. See Talend Cloud regions and URLs.
  5. To enter your personal access token, click [...].
  6. To retrieve the data quality rules, click Refresh.
    Refresh button next to the DQ rule library timestamp field.
    If you update the data quality rules in Talend Cloud Data Stewardship, click Refresh to retrieve the latest version.
  7. Click [...] next to Configure DQ rules.
  8. Select the rule in DQ rule and associate the rule variables to the input columns.
    Configuration window to associate the variables.
    You can associate the data types from Talend Studio with some data types from Talend Cloud Data Stewardship. See Associating data types from the properties.
    If no rules or input columns are available, verify that:
    • Data quality rules have been retrieved in DQ rule library timestamp.
    • The input schema is correct.
    Information noteRestriction: You can associate a rule variable to one column only. Enter one column only in Dynamic column selection.
  9. Associate more variables by clicking the Add icon button on the bottom right of the wizard.
  10. Select the same rule in DQ rule and associate the rule variables to the input columns.
  11. In Dynamic column selection, enter the column names between double quotes.
    Configuration window to associate the variables.
    You can associate the data types from Talend Studio with some data types from Talend Cloud Data Stewardship. See Associating data types from the properties.
    If no rules or input columns are available, verify that:
    • Data quality rules have been retrieved in DQ rule library timestamp.
    • The input schema is correct.
    Information noteRestriction: You can associate a rule variable to one column only. Enter one column only in Dynamic column selection.
  12. To validate the configuration, click OK.

Configuring the tLogRow components

Procedure

  1. Double-click the tLogRow components to open their Basic settings view.
  2. Select Built-in as Schema and click Sync columns.
  3. In the Mode area, select Table (print values in cells of a table).
    Configuration of the tLogRow components.

Saving and executing the Job

Procedure

Save your Job and press F6 to execute it.

Results

The tLogRow from the Main link returns the prices whose the discount correspond to the rule. In this example, it means that the discount is correct for price1 and price2.
.--+------+-----------------.
|         tLogRow_1         |
|=-+------+----------------=|
|id|price1|dynamic_column   |
|=-+------+----------------=|
|3 |1999  |0.15 - 4200 - 0.2|
|4 |5000  |0.2 - 3000 - 0.2 |
|5 |300   |0 - 30 - 0       |
'--+------+-----------------'
The invalid data is also filtered.
.--+------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------.
|                                                                                                                                       tLogRow_2                                                                                                                                       |
|=-+------+-----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------=|
|id|price1|dynamic_column   |INVALID_STATUS                                                                                                                                                                                                                                             |
|=-+------+-----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------=|
|1 |100   |0.2 - 2000 - 0.1 |[{"executionResults":[{"ruleName":"Discount percentage","inputColumn":["price1","dynamic_column.discount1"],"status":"Invalid"},{"ruleName":"Discount percentage","inputColumn":["dynamic_column.price2","dynamic_column.discount2"],"status":"Invalid"}]}]|
|2 |1000  |0.05 - 4000 - 0.2|[{"executionResults":[{"ruleName":"Discount percentage","inputColumn":["price1","dynamic_column.discount1"],"status":"Invalid"},{"ruleName":"Discount percentage","inputColumn":["dynamic_column.price2","dynamic_column.discount2"],"status":"Valid"}]}]  |
|6 |6000  |0.2 - 3000 - 0.1 |[{"executionResults":[{"ruleName":"Discount percentage","inputColumn":["price1","dynamic_column.discount1"],"status":"Valid"},{"ruleName":"Discount percentage","inputColumn":["dynamic_column.price2","dynamic_column.discount2"],"status":"Invalid"}]}]  |
|7 |price | -  -            |[{"executionResults":[{"ruleName":"Discount percentage","inputColumn":["price1","dynamic_column.discount1"],"status":"NE"},{"ruleName":"Discount percentage","inputColumn":["dynamic_column.price2","dynamic_column.discount2"],"status":"NE"}]}]          |
'--+------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------'

The NE status means that the rule cannot be executed on the data. In this scenario, the price value cannot be validated because the rule is set to validate numbers.

The other invalid data mean that the discount applied to price1 and/or price2 is incorrect.

For example, the row #2: The discount 0.05 is incorrect for the price 1000 but the discount 0.2 is correct for the price 4000.

When you save the Job, a JAR file containing the rule definition is created in:
studio-path\configuration\.m2\repository\org\talend\libraries\rules-number

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!