tMatchGroup MapReduce properties (deprecated)
These properties are used to configure tMatchGroup running in the MapReduce Job framework.
The MapReduce tMatchGroup component belongs to the Data Quality family.
The component in this framework is available in all Talend Platform products with Big Data and in Talend Data Fabric.
The MapReduce framework is deprecated from Talend 7.3 onwards. Use Talend Jobs for Apache Spark to accomplish your integration tasks.Basic settings
Schema and Edit schema |
A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word line when naming the fields. Click Sync columns to retrieve the schema from the previous component connected in the Job. The output schema of this component contains the following read-only fields:
GID: provides a group identifier
of the data type String.
Information noteNote: In Jobs migrated from previous releases to your current Talend Studio, the group identifier may be of the Long data
type. To have a group identifier of the String data type, replace
the tMatchGroup components in
the migrated Jobs with tMatchGroup components from the Palette.
GRP_SIZE: counts the number of records in the group, computed only on the master record. MASTER: identifies, by true or false, if the record used in the matching comparisons is a master record. There is only one master record per group. Each input record will be compared to the master record, if they match, the input record will be in the group. SCORE: measures the distance between the input record and the master record according to the matching algorithm used. In case the tMatchGroup component is used to have multiple output flows, the score in this column decides to what output group the record should go. GRP_QUALITY: provides the quality of similarities in the group by taking the minimal matching value. Only the master record has a quality score. |
|
Built-In: You create and store the schema locally for this component only. |
|
Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. |
Matching Algorithm |
Select from the list the algorithm you want to use in the component: Simple VSR is the only matching algorithm you can use with the Map/Reduce version of the component. If you converted a standard Job using tMatchGroup with the T-Swoosh algorithm to a Map/Reduce Job, select Simple VSR from the list and save the converted Job before its execution. Otherwise, an error occurs. |
Key Definition |
Input Key Attribute Select the column(s) from the input flow on which you want to apply a matching algorithm. Information noteNote: When you select a date column on which to apply an algorithm or a matching
algorithm, you can decide what to compare in the date format.
For example, if you want to only compare the year in the date, in the component schema set the type of the date column to Date and then enter "yyyy" in the Date Pattern field. The component then converts the date format to a string according to the pattern defined in the schema before starting a string comparison. |
Matching Function Select a matching algorithm from the list: Exact: matches each processed entry to all possible reference entries with exactly the same value. It returns 1 when the two strings exactly match, otherwise it returns 0. Exact - ignore case: matches each processed entry to all possible reference entries with exactly the same value while ignoring the value case. Soundex: matches processed entries according to a standard English phonetic algorithm. It indexes strings by sound, as pronounced in English, for example "Hello": "H400". It does not support Chinese characters. Levenshtein (edit distance): calculates the minimum number of edits (insertion, deletion or substitution) required to transform one string into another. Using this algorithm in the tMatchGroup component, you do not need to specify a maximum distance. The component automatically calculates a matching percentage based on the distance. This matching score will be used for the global matching calculation, based on the weight you assign in the Confidence Weight field. Metaphone: Based on a phonetic algorithm for indexing entries by their pronunciation. It first loads the phonetics of all entries of the lookup reference and checks all entries of the main flow against the entries of the reference flow. It does not support Chinese characters. Double Metaphone: a new version of the Metaphone phonetic algorithm, that produces more accurate results than the original algorithm. It can return both a primary and a secondary code for a string. This accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry. It does not support Chinese characters. Soundex FR: matches processed entries according to a standard French phonetic algorithm. It does not support Chinese characters. Jaro: matches processed entries according to spelling deviations. It counts the number of matched characters between two strings. The higher the distance is, the more similar the strings are. Jaro-Winkler: a variant of Jaro, but it gives more importance to the beginning of the string.
Fingerprint key:
matches entries after doing the following sequential process:
q-grams: matches processed entries by dividing strings into letter blocks of length q in order to create a number of q length grams. The matching result is given as the number of q-gram matches over possible q-grams. Hamming: calculates the minimum number of substitutions required to transform one string into another string having the same length. For example, the Hamming distance between "masking" and "pairing" is 3. custom...: enables you to load an external matching algorithm from a Java library using the custom Matcher column. For further information about how to load an external Java library, see tLibraryLoad. For further information about how to create a custom matching algorithm, see Creating a custom matching algorithm. For a related scenario about how to use a custom matching algorithm, see Using a custom matching algorithm to match entries. |
|
|
Custom Matcher When you select Custom as the matching type, enter the path pointing to the custom class (external matching algorithm) you need to use. This path is defined by yourself in the library file (.jar file). For example, to use a MyDistance.class class stored in the directory org/talend/mydistance in a user-defined mydistance.jar library, the path to be entered is org.talend.mydistance.MyDistance. |
Weight Set a numerical weight for each attribute (column) of the key definition. The values can be anything >= 0. |
|
|
Handle Null To handle null values, select from the list the null operator you want to use on the column: Null Match Null: a Null attribute only matches another Null attribute. Null Match None: a Null attribute never matches another attribute. Null Match All: a Null attribute matches any other value of an attribute. For example, if we have two columns, name and firstname where the name is never null, but the first name can be null. If we have two records: "Doe", "John" "Doe", "" Depending on the operator you choose, these two records may or may not match: Null Match Null: they do not match. Null Match None: they do not match. Null Match All: they match. And for the records: "Doe", "" "Doe", "" Null Match Null: they match. Null Match None: they do not match. Null Match All: they match. |
Match Threshold |
Enter the match probability. Two data records match when the probability is above the set value. You can enter a different match threshold for each match rule. |
Blocking Selection |
Input Column If required, select the column(s) from the input flow according to which you want to partition the processed data in blocks, this is usually referred to as "blocking". Blocking reduces the number of pairs of records that needs to be examined. In blocking, input data is partitioned into exhaustive blocks designed to increase the proportion of matches observed while decreasing the number of pairs to compare. Comparisons are restricted to record pairs within each block. Using blocking column(s) is very useful when you are processing very big data. |
Advanced settings
Store on disk |
Select the Store on disk check box if you want to store processed data blocks on the disk to maximize system performance. Max buffer size: Type in the size of physical memory you want to allocate to processed data. Temporary data directory path: Set the location where the temporary file should be stored. |
Multiple output |
Select the Separate output check box to have three different output flows: -Uniques: when the group score (minimal distance computed in the record) is equal to 1, the record is listed in this flow. -Matches: when the group score (minimal distance computed in the record) is higher than the threshold you define in the Confidence threshold field, the record is listed in this flow. -Suspects: when the group score (minimal distance computed in the record) is below the threshold you define in the Confidence threshold field, the record is listed in this flow. Confident match threshold: set a numerical value between the current Match threshold and 1. Above this threshold, you can be confident in the quality of the group. |
Multi-pass |
Select this check box to enable a
tMatchGroup
component to receive data sets from another tMatchGroup that precedes it in the
Job. This will refine the groups received by each of the tMatchGroup components
through creating data partitions based on different blocking keys.
Information noteNote: When using two tMatchGroup
components in a Job and this option, you must select this check box
in both tMatchGroup components before linking
them together. If you linked the components before selecting this
check box, select this check box in the second component in the Job
flow and then, in the first component. Otherwise, you may have an
issue as there are two columns in the output schema with the same
name. Selecting this check box in only one
tMatchGroup component may cause schema
mismatch issues.
For an example Job, see Matching customer data through multiple passes |
Sort the output data by GID |
Select this check box to group the output data by the group identifier. The output is sorted in ascending alphanumeric order by group identifier. |
Output distance details |
Select this check box to add an output column MATCHING_DISTANCES in the schema of the component. This column provides the distance between the input and master records in each group. Information noteNote: When using two tMatchGroup
components in a Job and this option, you must select this check box in
both tMatchGroup components before linking them
together. If you linked the components before selecting this check box,
select this check box in the second component in the Job flow and then,
in the first component. Otherwise, you may have an issue as there are
two columns in the output schema with the same name. Selecting this
check box in only one tMatchGroup component may cause
schema mismatch issues.
|
Display detailed labels |
Select this check box to have in the output MATCHING_DISTANCES column not only the matching distance but also the names of the columns used as key attributes in the applied rule. For example, if you try to match on first name and last name fields, lname and fname, the output would be fname:1.0|lname:0.97 when the check box is selected and 1.0|0.97 when it is not selected. |
tStatCatcher Statistics |
Select this check box to collect log data at the component level. Note that this check box is not available in the Map/Reduce version of the component. |
Global Variables
Global Variables |
ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box. A Flow variable functions during the execution of a component while an After variable functions after the execution of the component. To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it. For further information about variables, see Talend Studio User Guide. |
Usage
Usage rule |
In a Talend Map/Reduce Job, this component is used as an intermediate step and other components used along with it must be Map/Reduce components, too. They generate native Map/Reduce code that can be executed directly in Hadoop. You need to use the Hadoop Configuration tab in the Run view to define the connection to a given Hadoop distribution for the whole Job. For further information about a Talend Map/Reduce Job, see the sections describing how to create, convert and configure a Talend Map/Reduce Job of the Talend Big Data Getting Started Guide . For a scenario demonstrating a Map/Reduce Job using this component, see Matching data through multiple passes using Map/Reduce components. Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs, and non Map/Reduce Jobs. |