From the tMatchGroup configuration wizard, you can import match keys from the
match rules created and tested in the
Profiling
perspective. You can then use these imported matching keys in your match
Jobs.
The tMatchGroup component enables you
to import from the Talend Studio
repository match rules based on the VSR or the T-Swoosh algorithms.
The VSR algorithm takes a set of records as input and groups similar encountered
duplicates together according to defined match rules. It compares pairs of records and
assigns them to groups. The first processed record of each group is the master record of the
group. The VSR algorithm compares each record with the master of each group and uses the
computed distances, from master records, to decide to what group the record should go.
The T-Swoosh algorithm enables you to find duplicates and to define how
two similar records are merged to create a master record, using a survivorship function.
These new merged records are used to find new duplicates. The difference with the VSR
algorithm is that the master record is in general a new record that does not exist in
the list of input records.
-
From the configuration wizard, click the icon on the top right corner.
The Match Rule Selector
wizard opens listing all match rules created in Talend Studio and saved in the repository.
-
Select the match rule you want to import into the tMatchGroup component and use on your data.
A warning message displays in the wizard if the match rule you want to import is
defined on columns that do not exist in the input schema of tMatchGroup. You can define input columns later in the configuration
wizard.
It is important to have the same type of the matching algorithm
selected in the basic settings of the component and imported from the
configuration wizard. Otherwise the Job runs with default values for the
parameters which are not compatible between the two algorithms.
Information noteRemember: If you are using the Apache Spark Batch component, do not
import a match rule using the T-Swoosh algorithm. The component does not
support this algorithm.
-
Select the Overwrite current Match Rule in the
analysis check box if you want to replace the rule in the
configuration wizard with the rule you import.
If you leave the box unselected, the match keys will be imported in a new match
rule tab without overwriting the current match rule in the wizard.
-
Click OK.
The matching key is imported from the match rule and listed as a new rule in the
configuration wizard.
-
Click in the Input Key Attribute and select from
the input data the column on which you want to apply the matching key.
-
In the Match threshold
field, enter the match probability threshold.
Two data records match when the computed match score is greater than or equal
to this value.
-
In the Blocking
Selection table, select the columns from the input flow which
you want to use as a blocking key.
Defining a blocking key is not mandatory but advisable. Using
a blocking key partitions data in blocks and so reduces the number of records
that need to be examined, as comparisons are restricted to record pairs in each
block. Using blocking keys is very useful when you are processing big
datasets.
The Blocking Selection
table in the component is different from the Generation of Blocking Key table in the match rule editor of
the
Profiling
perspective.
The blocking column in tMatchGroup can come from a tGenKey component, and would be called T_GEN_KEY, or directly from the input schema, it can be a ZIP column for instance. While the Generation of Blocking Key table in the match
rule editor defines the parameters necessary to generate a blocking key; this
table is equivalent to the tGenKey
component. The Generation of Blocking Key
table generates a blocking column BLOCK_KEY used for
blocking.
-
Click the Chart button in the top right corner of
the wizard to execute the Job using the imported match rule and show the matching
results in the wizard.