How does tMatchPairing compute the sample of suspect duplicate pairs?

The list of suspect duplicate pairs can be very large. You label only a subset of this list to identify the potential groups of duplicates.

You can then use machine learning to predict labels for the whole list. Then, it is possible to output a sample of this list, with a size fixed manually. The sample is chosen randomly.

For an example of how to handle grouping tasks to decide on relationship among pairs of records using Talend Data Stewardship, see .

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!

Leave your feedback here