Doing continuous matching
If you want to match new records against a clean data set, you do not need to restart the matching process from scratch.
You can reuse and index the clean set and to do continuous matching.To be able to perform continuous matching tasks, Elasticsearch version 5.1.2+ must be running.
The continuous matching process is made up of the following steps:
- The first step consists of computing suffixes to separate clean and
deduplicated records from a data set and indexing them in Elasticsearch using
tMatchIndex.
For an example of how to index a data in Elasticsearch using tMatchIndex, see Indexing a reference data set in Elasticsearch.
You can find an example of how to index a data in Elasticsearch using tMatchIndexon Talend Help Center (https://help.talend.com).
- The second step consists of comparing the indexed records with new
records having the same schema and outputting matching and non-matching records
using tMatchIndexPredict. This component uses the
pairing and matching models generated by tMatchPairing and tMatchModel.
For an example of how to matching new records against records from a reference dataset, see Doing continuous matching using tMatchIndexPredict.
You can find an example of how to do continuous matching using tMatchIndex on Talend Help Center (https://help.talend.com).
You can then clean and deduplicate the non-matching records using tRuleSurvivorship and populate the clean data set indexed in Elasticsearch using tMatchIndex.