tRecordMatching
Ensures the data quality of any source data against a reference data source.
tRecordMatching joins two tables by doing a fuzzy match on several columns using a wide variety of comparison algorithms. It compares columns from the main flow with reference columns from the lookup flow and according to the matching strategy you define, outputs the match data, the possible match data and the rejected data. On arranging your matching strategy, the user-defined matching scores are critical to determine the match level of the data of interest.
This component is not shipped with your Talend Studio by default. You need to install it using the Feature Manager. For more information, see Installing features using the Feature Manager.
tRecordMatching Standard properties
These properties are used to configure tRecordMatching running in the Standard Job framework.
The Standard tRecordMatching component belongs to the Data Quality family.
This component is available in Talend Data Management Platform, Talend Big Data Platform, Talend Real-Time Big Data Platform, Talend Data Services Platform and in Talend Data Fabric.
Basic settings
Properties | Description |
---|---|
Schema and Edit schema |
|
Replace output column with lookup column if matches or possible matches |
Select this check box to replace the output column with the lookup column in case of match or possible match values. Once this check box is selected, the Columns Mapping table appears. |
Columns Mapping |
|
Input Key Attribute |
Select the columns from the main flow that needs to be checked against the reference (lookup) key column. Information noteNote: When you select a date column on which to apply an algorithm or a matching
algorithm, you can decide what to compare in the date format.
For example, if you want to only compare the year in the date, in the component schema set the type of the date column to Date and then enter "yyyy" in the Date Pattern field. The component then converts the date format to a string according to the pattern defined in the schema before starting a string comparison. |
Lookup Key Attribute |
Select the lookup key columns that you will use as a reference against which to compare the columns from the input flow. |
Matching Function |
Select the relevant matching algorithm from the list: Exact Match: matches each processed entry to all possible reference entries with exactly the same value. Levenshtein: Based on the edit distance theory. It calculates the number of insertion, deletion, or substitution required for an entry to match the reference entry. Metaphone: Based on a phonetic algorithm for indexing entries by their pronunciation. It first loads the phonetics of all entries of the lookup reference and checks all entries of the main flow against the entries of the reference flow. Double Metaphone: a new version of the Metaphone phonetic algorithm, that produces more accurate results than the original algorithm. It can return both a primary and a secondary code for a string. This accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry. Exact - ignore case: matches each processed entry to all possible reference entries with exactly the same value while ignoring the value case. Soundex: matches processed entries according to a standard English phonetic algorithm. Soundex FR: matches processed entries according to a standard French phonetic algorithm. Jaro: matches processed entries according to spelling deviations. q-grams: matches processed entries by dividing strings into letter blocks of length q in order to create a number of q length grams. The matching result is given as the number of q-gram matches over possible q-grams. Hamming: calculates the minimum number of substitutions required to transform one string into another string having the same length. For example, the Hamming distance between "masking" and "pairing" is 3. custom...: enables you to load an external matching algorithm from a Java library. The Custom Matcher column alongside is activated when you selected this option. For further information about how to load an external Java library, see tLibraryLoad. |
Custom Matcher |
|
Weight |
Set a numerical weight for each attribute (column) of the key definition. The values can be anything >= 0. |
Handle Null |
Handle Null To handle null values, select from the list the null operator you want to use on the column: Null Match Null: a Null attribute only matches another Null attribute. Null Match None: a Null attribute never matches another attribute. Null Match All: a Null attribute matches any other value of an attribute. For example, if there are two columns, name and firstname where the name is never null, but the first name can be null. If there are two records: "Doe", "John" "Doe", "" Depending on the operator you choose, these two records may or may not match: Null Match Null: they do not match. Null Match None: they do not match. Null Match All: they match. And for the records: "Doe", "" "Doe", "" Null Match Null: they match. Null Match None: they do not match. Null Match All: they match. |
Input Column |
If required, select the columns from the input flow according to which you want to partition the processed data in blocks, this is usually referred to as "blocking". Blocking reduces the number of pairs of records that needs to be examined. In blocking, input data is partitioned into exhaustive blocks designed to increase the proportion of matches observed while decreasing the number of pairs to compare. Comparisons are restricted to record pairs in each block. Using blocking columns is very useful when you are processing very big data. |
Matching strategy |
Select the matching output that best fulfills your needs. This
option may be:
|
Advanced settings
Properties | Description |
---|---|
Matching Algorithm |
Select an algorithm from the list. Only one algorithm is available for the time being Simple VSR: This algorithm is based on a Vector Space Retrieval method that specifies how two records may match. For further information about how to import rules based on the VSR algorithm, see Importing match rules from the repository. |
Possible match interval |
Enter a minimum and a maximum values: minimum: set the minimum record distance allowed to match the reference (0 <= minimum). maximum: set the maximum distance allowed to match the reference (maximum <=1). For example, if you set 0.5 as the minimum value and 0.9 as the maximum value, the scores equal or higher than 0.9 indicate match, the scores between 0.5 excluded and 0.9 excluded indicate possible match and the other scores indicate non match. |
Store on disk |
Select this check box if you want to store processed data blocks on the disk to maximize system performance. |
tStatCatcher Statistics |
Select this check box to collect log data at the component level. |
Global Variables
Variables | Description |
---|---|
Global Variables |
NB_MATCH_LINE: the number of rows matching the comparison algorithm. This is an After variable and it returns an integer. NB_POSSIBLE_MATCH_LINE: the number of rows possibly matching the comparison algorithm. This is an After variable and it returns an integer. NB_NONE_MATCH_LINE: the number of rows not matching the comparison algorithm. This is an After variable and it returns an integer. ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box. A Flow variable functions during the execution of a component while an After variable functions after the execution of the component. To fill up a field or expression with a variable, press Ctrl+Space to access the variable list and choose the variable to use from it. For more information about variables, see Using contexts and variables. |
Usage
Usage guidance | Description |
---|---|
Usage rule |
This component is not startable and it requires two input components and one or more output components. |