Choosing metrics and defining matching rules
After blocking data into similar sized group, you can create match rules and test them before using them in the tMatchGroup component.
For more information about creating a match analysis, see Creating a match analysis.
Matching functions in the tMatchGroup component
tMatchGroup helps you create groups of similar data records in any source of data including large volumes of data by using one or several match rules.
- Phonetic algorithms, such as Soundex or Metaphone, are used to match names.
- The Levensthein distance calculates the minimum number of edits required to transform one string to another.
- The Jaro distance matches processed entries according to spelling deviations.
- The Jaro-Winkler distance is a variant of Jaro giving more importance to the beginning of the string.
For more information on how to use the tMatchGroup component in standard and Map/Reduce Jobs, see tMatchGroup.
The Simple VSR Matcher and the T-Swoosh algorithms
- Simple VSR Matcher
- T-Swoosh
For more information about match analyses, see "Create a match rule" on Talend Help Center.
When do records match?
- When using the T-Swoosh algorithm, the score returned for each matching function must be higher than the threshold you set.
- The global score, computed as a weighted score of the different matching functions, must be higher than the match threshold.
Multiple passes
In general, different partitioning schemes are necessary. This requires using sequentially tMatchGroup components to match data against different blocking keys.
For an example of how to match data through multiple passes, see Matching customer data through multiple passes.
Working with the tRecordMatching component
tRecordMatching joins compared columns from the main flow with reference columns from the lookup flow. According to the matching strategy you define, tRecordMatching outputs the match data, the possible match data and the rejected data. When arranging your matching strategy, the user-defined matching scores are critical to determine the match level of the data of interest.
For more information about the tRecordMatching component, see tRecordMatching.