Importing match rules from the Talend Studio repository

From the tMatchGroup configuration wizard, you can import match keys from the match rules created and tested in the Profiling perspective. You can then use these imported matching keys in your match Jobs.

The tMatchGroup component enables you to import from the Talend Studio repository match rules based on the VSR or the T-Swoosh algorithms.

The VSR algorithm takes a set of records as input and groups similar encountered duplicates together according to defined match rules. It compares pairs of records and assigns them to groups. The first processed record of each group is the master record of the group. The VSR algorithm compares each record with the master of each group and uses the computed distances, from master records, to decide to what group the record should go.

The T-Swoosh algorithm enables you to find duplicates and to define how two similar records are merged to create a master record, using a survivorship function. These new merged records are used to find new duplicates. The difference with the VSR algorithm is that the master record is in general a new record that does not exist in the list of input records.

Procedure

From the configuration wizard, click the icon on the top right corner.
The Match Rule Selector wizard opens listing all match rules created in Talend Studio and saved in the repository.
Select the match rule you want to import into the tMatchGroup component and use on your data.
A warning message displays in the wizard if the match rule you want to import is defined on columns that do not exist in the input schema of tMatchGroup. You can define input columns later in the configuration wizard.

It is important to have the same type of the matching algorithm selected in the basic settings of the component and imported from the configuration wizard. Otherwise the Job runs with default values for the parameters which are not compatible between the two algorithms.
Information noteRemember: If you are using the Apache Spark Batch component, do not import a match rule using the T-Swoosh algorithm. The component does not support this algorithm.
Select the Overwrite current Match Rule in the analysis check box if you want to replace the rule in the configuration wizard with the rule you import.
If you leave the box unselected, the match keys will be imported in a new match rule tab without overwriting the current match rule in the wizard.
Click OK.
The matching key is imported from the match rule and listed as a new rule in the configuration wizard.
Click in the Input Key Attribute and select from the input data the column on which you want to apply the matching key.
In the Match threshold field, enter the match probability threshold.
Two data records match when the computed match score is greater than or equal to this value.
In the Blocking Selection table, select the columns from the input flow which you want to use as a blocking key.
Defining a blocking key is not mandatory but advisable. Using a blocking key partitions data in blocks and so reduces the number of records that need to be examined, as comparisons are restricted to record pairs in each block. Using blocking keys is very useful when you are processing big datasets.

The Blocking Selection table in the component is different from the Generation of Blocking Key table in the match rule editor of the Profiling perspective.

The blocking column in tMatchGroup can come from a tGenKey component, and would be called T_GEN_KEY, or directly from the input schema, it can be a ZIP column for instance. While the Generation of Blocking Key table in the match rule editor defines the parameters necessary to generate a blocking key; this table is equivalent to the tGenKey component. The Generation of Blocking Key table generates a blocking column BLOCK_KEY used for blocking.
Click the Chart button in the top right corner of the wizard to execute the Job using the imported match rule and show the matching results in the wizard.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!

Leave your feedback here