Using survivorship functions to merge two records and create a master record

This scenario describes a basic Job that uses the T-Swoosh algorithm to find duplicates in customer names. It compares values in the first_name column using matching and survivorship methods, groups similar records together and creates single representations of entities in each group by merging two similar records according to a survivorship rule.

These master records are new records that do not exist in the input data.

You can use several matching passes with the T-Swoosh algorithm in a Job. Each pass provides the survived masters for each group and the original records to the second pass. However, intermediate master records are removed from the output flow. Only final master and original records are kept at the end.

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real-Time Big Data Platform, Talend MDM Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

About this task

As the T-Swoosh algorithm is not supported by the Apache Spark Batch component, this scenario uses the Standard component.

In this Job:

The tFileInputDelimited component provides the customer records to be processed.
The tMatchGroup component processes data using matching and survivorship methods.
The tLogRow component presents the matching and survivorship results.

Setting up the Job

About this task

In this scenario, the main input schema is already stored in the Repository. For more information about storing schema metadata in the Repository, see Managing metadata in Talend Studio.

Procedure

In the Repository tree view, expand Metadata - File delimited where you have stored the main input schema and drop the file connection onto the design workspace.
The file connection used in this scenario is called Customer_File.

A dialog box is displayed with a list of components.
Select tFileInputDelimited and click OK.
Drop a tMatchGroup and a tLogRow from the Palette onto the design workspace.
Link the components together using the Main links.

Configuring the input component

About this task

The main input file contains the columns: id, first_name and age. The data in this input file has problems such as duplication, names spelled differently or wrongly, different information for the same customer.

Procedure

Double-click tFileInputDelimited to open its Component view.

The property fields are automatically filled in. If you do not define your input schema locally in the repository, fill in the details manually after selecting Built-in in the Schema and Property Type lists.
Create the schema through the Edit Schema button, if the schema is not already stored in the Repository. Remember to set the data type in the Type column.

Setting the schema and selecting the matching algorithm

Procedure

Double-click tMatchGroup to display the Basic settings view and define the component properties.
From the Matching Algorithm list, select T-Swoosh.
In this scenario, the match rule is based on the T-Swoosh algorithm.
Click Sync columns to retrieve the schema from the preceding component.
Click the Edit schema button to view the input and output schema and do any modifications in the output schema, if necessary.

In the output schema of this component there are few output standard columns that are read-only. For more information, see tMatchGroup Standard properties.
In the Advanced settings view, leave the parameters selected by default.

Defining the match rule

Procedure

In the tMatchGroup basic settings, click Preview to open the configuration wizard and define the matching key and the survivorship function.

You can use the configuration wizard to import match rules created and tested in Talend Studio and stored in the repository, and use them in your match Jobs. For further information, see Importing match rules from the repository.

It is important to have the same type of the matching algorithm selected in the basic settings of the component and defined in the configuration wizard. Otherwise the Job runs with default values for the parameters which are not compatible between the two algorithms.
Define the match rule as the following:
- In the Key definition table, click the [+] button to add a line in the table. Click in the Input Key Attribute column and select the column on which you want to do the matching operation, first_name in this scenario.
- Click in the Matching Function column and select Soundex from the list. This method matches processed entries according to a standard English phonetic algorithm which indexes strings by sound, as pronounced in English.
- From the Tokenized measure list, select not to use a tokenized distance for the selected algorithm.
- Set the Threshold to 0.8 and the Confidence Weight to 1.
- Select Null Match None in the Handle Null column in order to have matching results where null values have minimal effect.
- Select Most common in the Survivorship Function column. This method validates the most frequent name value in each group of duplicates.
Define the default survivorship rule as the following:
- In the Default Survivorship Rules table, click the [+] button to add a line in the table. Click in the Data Type column and select Number.
- Click in the Survivorship Function column and select Largest (for numbers) from the list. This method validates the largest numerical value in each group.
Set the Hide groups of less than parameter in order to decide what groups to show in the result chart and matching table.
This parameter enables you to hide groups of small group size.
Click the Chart button in the wizard to execute the Job in the defined configuration and have the results directly in the wizard.

The matching chart gives a global picture about the duplicates in the analyzed data. The matching table indicates the details of the items in each group, colors the groups in accordance with their color in the matching chart and indicates with true the records which are master records. The master record in each group is the result of merging two similar records according to the phonetic algorithm and survivorship rule. The master record is a new record that does not exist in the input data.
Click OK to close the wizard.

Finalizing the Job and executing it

Procedure

Double-click the tLogRow component to display its Basic settings view and define the component properties.
Save your Job and press F6 to execute it.

Similar records are grouped together. The number of records in each group is listed in the GRP_SIZE column and computed only on the master record. The MASTER column indicates with true or false if the corresponding record is a master record or not a master record. The master record is created by merging the record which best matches the phonetic algorithms with the record which has the largest age value. The SCORE column lists the calculated similarity between the input record and the master record according to the Soundex matching algorithm.

The identifier for each group, which is of String data type, is listed in the GID column next to the corresponding record. This identifier will be of the data type Long for Jobs that are migrated from older releases. To have the group identifier as String, you must replace the tMatchGroup component in the imported Job with tMatchGroup from the Talend Studio Palette.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!

Leave your feedback here