How does tMatchPairing compute the sample of suspect duplicate pairs?
The list of suspect duplicate pairs can be very large. You label only a subset of this list to identify the potential groups of duplicates.
You can then use machine learning to predict labels for the whole list. Then, it is possible to output a sample of this list, with a size fixed manually. The sample is chosen randomly.
For an example of how to handle grouping tasks to decide on relationship among pairs of records using Talend Data Stewardship, see .