Matching customer data through multiple passes

The Job in this scenario groups similar customer records by running through two subsequent matching passes using the tMatchGroup components and outputs the calculated matches in groups.

Each pass provides its matches to the pass that follows in order for the latter to add more matches identified with new rules and blocking keys.

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real-Time Big Data Platform, Talend MDM Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

Note: This scenario was made with the Standard component but you can use the Apache Spark Batch component.

About this task

In this Job:

The tMysqlInput component connects to the customer records to be processed.
Each of the tGenKey components defines a way to partition data records. The first key partitions data to many groups and the second key creates fewer groups that overlaps the previous blocks depending on the blocking key definition.
The tMap component renames the key generated by the second tGenKey component.
The first tMatchGroup processes the partitions defined by the first tGenKey, and the second tMatchGroup processes those defined by the second tGenKey.
Information noteImportant: The two tMatchGroup components must have the same schema.
The tLogRow component presents the matching results after the two passes.

Setting up the Job

About this task

In this scenario, the main input schema is already stored in the Repository. For more information about storing schema metadata in the Repository, see Managing metadata in Talend Studio.

Procedure

In the Repository tree view, expand Metadata - DB Connections where you have stored the main input schema.
Drop the database table onto the design workspace. The input table used in this scenario is called customer.
A dialog box is displayed with a list of components.
Select the relevant database component, tMysqlInput in this example, and then click OK.
Drop two tGenKey components, two tMatchGroup components, a tMap and a tLogRow components from Palette onto the design workspace.
Link the input component to the tGenKey and tMap components using Main links.
In the two tMatchGroup components, select the Output distance details check boxes in the Advanced settings view of both components before linking them together.
This will provide the MATCHING_DISTANCES column in the output schema of each tMatchGroup.

If the two tMatchGroup components are already linked to each other, you must select the Output distance details check box in the second component in the Job flow first otherwise you may have an issue.
Link the two tMatchGroup components and the tLogRow component using Main links.
If needed, give the components specific labels to reflect their usage in the Job.
For further information about how to label a component, see Component View tab.

Connecting to the input data

Procedure

Double-click tMysqlInput to open its Component view.

The property fields for tMysqlInput are automatically filled in. If you do not define your input schema locally in the repository, fill in the details manually after selecting Built-in in the Schema and Property Type lists.

The input table used in this scenario is called customer.
Modify the query in the Query box to select only the columns you want to match: account_name, lname, fname, mi, address1, city, state_province and postal_code.

Configuring the key generation for the first pass

Procedure

Double-click the first tGenKey to open the Component view.
Click and import blocking keys from match rules created and tested in the Profiling perspective and use them in your Job. Otherwise, define the blocking key parameters as described in the below steps.
Under the Algorithm table, click the [+] button to add two rows in the table.
In the column column, click the newly added row and select from the list the column to process using an algorithm.
In this example, select lname.
Do the same on the second row to select postal_code.
In the pre-algorithm column, click the newly added row and select from the list the pre-algorithm to apply to the corresponding column.
In this example, select remove diacritical marks and convert to upper case to remove any diacritical mark and converts the fields of the lname column to upper case.

This conversion does not change your raw data.
In the algorithm column, click the newly added row and select from the list the algorithm to apply to the corresponding column.
In this example, select N first characters of each word.
If you select the Show help check box, you can display instructions on how to set algorithms/options parameters.
Do the same for the second row on the algorithm column to select first N characters of the string.
Click in the Value column next to the algorithm column and enter the value for the selected algorithm, when needed.
In this scenario, enter 1 for both rows. The first letter of each field in the corresponding columns will be used to generate the key.

Make sure to set a value for the algorithm which need one, otherwise you may have a compilation error when you run the Job.

Configuring the key generation for the second pass

Procedure

Double-click the second tGenKey to open the Component view.
In the Algorithm table, define the column you want to use to partition data, account_num in this component. Select the first N characters of the string algorithm and set the value to 1 in the Value column.
Each of the two tGenKey components will generate a read_only T_GEN_KEY column in the output schema. You must rename one of theT_GEN_KEY columns to stop them from overwriting each other.
Double-click the tMap component to open its editor.
In the Schema editor, copy the columns from the first table onto the second table and rename T_GEN_KEY to T_GEN_KEY1, for example.
In the top part of the editor, drop all columns from the input table to the output table.
Click Ok to save data transformation and close the editor.
In the tGenKey basic settings, click the […] button to verify that the two generated keys are named differently in the output schema.

Configuring the first pass

Procedure

In the basic settings of the tMatchGroup labeled pass1, select Simple VSR from the Matching Algorithm list.
In this scenario, the match rule is based on the VSR algorithm.
Click the Preview button to display the Configuration Wizard.
Click and import matching keys from the match rules created and tested in the Profiling perspective and use them in your Job. Otherwise, define the matching key parameters as described in the below steps.
It is important to import or define the same type of the rule selected in the basic settings of the component, otherwise the Job runs with default values for the parameters which are not compatible between the two algorithms.
In the Key definition table, click the [+] button to add the column(s) on which you want to do the matching operation, lname in this scenario.

Information noteNote: When you select a date column on which to apply an algorithm or a matching algorithm, you can decide what to compare in the date format.
For example, if you want to only compare the year in the date, in the component schema set the type of the date column to Date and then enter "yyyy" in the Date Pattern field. The component then converts the date format to a string according to the pattern defined in the schema before starting a string comparison.
Select the Jaro-Winkler algorithm in the Matching Function column.
From the Tokenized measure list, select Any order.
Set Weight to 1 and in the Handle Null column, select the null operator you want to use to handle null attributes in the column, Null Match Null in this scenario.
Click the [+] button below the Blocking Selection table to add one row in the table.
Click in the line and select from the list the column to use as a blocking value, T_GEN_KEY in this example.
Using a blocking value reduces the number of pairs of records that needs to be examined. The input data is partitioned into exhaustive blocks based on the functional key. This will decrease the number of pairs to compare, as comparison is restricted to record pairs within each block.
If required, click Edit schema to open the schema editor and see the schema retrieved from the previous component in the Job.
Click the Advanced settings tab and select the Sort the output data by GID check box to arrange the output data by their group IDs.
Select the Deactivate matching computation when opening the wizard check box if you do not want to run the match rules the next time you open the wizard.

Configuring the second pass

Procedure

In the basic settings of the tMatchGroup labeled pass2, select Simple VSR from the Matching Algorithm list.
In this scenario, the match rule is based on the VSR algorithm.
Click the Preview button to display the Configuration Wizard.
If this component does not have the same schema of the preceding component, a warning icon appears. If so, click the Sync columns button to retrieve the schema from the preceding one and once done, the warning icon disappears.
In the Key Definition table, click the [+] button to add the column(s) on which you want to do the matching operation, lname in this scenario.

Information noteNote: When you select a date column on which to apply an algorithm or a matching algorithm, you can decide what to compare in the date format.
For example, if you want to only compare the year in the date, in the component schema set the type of the date column to Date and then enter "yyyy" in the Date Pattern field. The component then converts the date format to a string according to the pattern defined in the schema before starting a string comparison.
Select the Jaro-Winkler algorithm in the Matching Function column.
Set Weight to 1 and in the Handle Null column, select the null operator you want to use to handle null attributes in the column, Null Match Null in this scenario.
Click the [+] button below the Blocking Selection table to add one row in the table.
Click in the line and select from the list the column you want to use as a blocking value, T_GEN_KEY1 in this example.
Click the Advanced settings tab and select the Multi-pass check box. This option enables tMatchGroup to receive data sets from the tMatchGroup that precedes it in the Job.
In the Advanced settings view, select the Sort the output data by GID check box to arrange the output data by their group IDs.
Select the Deactivate matching computation when opening the wizard check box if you do not want to run the match rules the next time you open the wizard.

Executing the Job and showing the results on the console

About this task

In order to show the match groups created after the first pass and compare them with the groups created after the second pass, you must modify the Job as the following:

Use a tReplicate component to replicate the input flow you want to process as shown in the above figure. Use a copy/paste operation to create the two parts of the Job.
Keep only the first pass, tMatchGroup component, in the upper part of the Job and show the match results in a tLogRow component.
Use two passes in the lower part of the Job and show the final match results in a tLogRow component.

Procedure

Double-click each of the tLogRow components to open the Component view and in the Mode area, select the Table (print values in cells of a table) option.
Save your Job and press F6 to execute it.
The results after the first pass read as follows:

The results after the second pass read as follows:

When you compare, for example, the customer name Wilson from the results of the two passes, you will find that more customers using the last name Wilson are grouped together after the second pass.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!

Leave your feedback here