Computing suspect pairs and suspect sample from source data
This scenario applies only to subscription-based Talend Platform products with Big Data and Talend Data Fabric.
In this example, tMatchPairing uses a blocking key to compute the pairs of suspect duplicates in a list of early childhood education centers in Chicago.
The use case described here uses:
-
a tFileInputDelimited component to read the source file, which contains a list of early childhood education centers in Chicago coming from ten different sources;
-
a tMatchPairing component to pre-analyze the data, compute pairs of suspect duplicates and generate a pairing model which is used by the tMatchPredict component;
-
three tFileOutputDelimited components to output the suspect duplicates, a sample of suspect pairs and the unique records; and
-
a tLogRow component to output the exact duplicates.