Computing suspect pairs and writing a sample in a Grouping campaign
Procedure
-
Double-click tMatchPairing to display
the Basic settings view and define the component
properties.
- Click Sync columns to retrieve the schema defined in the input component.
-
In the Blocking Key table, click the
[+] button to add a row. Select the column you want
to use as a blocking key, Site_name in this
example.
The blocking key is constructed from the center name and is used to generate the suffixes used to group pairs of records.
-
In the Suffix array blocking parameters section:
- In the Min suffix length field, set the minimum suffix length you want to reach or stop at in each group.
- In the Max block size field, set the maximum number of the records you want to have in each block. This helps filtering data in large blocks where the suffix is too common.
-
In the Folder field, set the path to the local folder
where you want to generate the pairing model file.
If you want to store the model in a specific file system, for example S3 or HDFS, you must use the corresponding component in the Job and select the Define a storage configuration component check box in the component basic settings.
-
Select the Integration with Data Stewardship check box
and set the connection parameters to the Talend Data Stewardship
server.
-
In the URL field, enter the address of the application suffixed with /data-stewardship/, for example http://localhost:19999/data-stewardship/.
If you are working with Talend Cloud Data Stewardship, use the URL for the corresponding data center suffixed with /data-stewardship/ to access the application, for example, https://tds.us.cloud.talend.com/data-stewardship for the AWS US data center.
For the URLs of available data centers, see Talend Cloud regions and URLs.
-
Enter your login information
in
the Username and Password
fields.
To enter your password, click ... next to the field, enter your password between double quotes in the dialog box that opens and click OK.If you are working with Talend Cloud Data Stewardship and if:
- SSO is enabled, enter an access token in the field.
- SSO is not enabled, enter either an access token or your password in the field.
- Click Find a campaign to open a dialog box which lists the campaigns defined in Talend Data Stewardship and for which you are the owner or you have the access rights.
- Select the Sites deduplication campaign in which to write the grouping tasks and click OK.
-
In the URL field, enter the address of the application suffixed with /data-stewardship/, for example http://localhost:19999/data-stewardship/.
-
Click Advanced settings and set the below
parameters:
-
In the Filtering threshold field, enter a value
between 0.2 and 0.85 to filter the pairs of suspect records based on the
calculated scores.
This value helps to exclude the pairs which are not very similar. The higher the value is, the more similar the records are.
- Leave the Set a random seed check box clear as you want to generate a different sample by each execution of the Job.
- In the Number of pairs field, enter the size of the suspect pairs sample you want to generate.
-
When configured with Talend Data Stewardship,
enter the maximum number of the tasks to load per a commit in the
Max tasks per commit field.
There are no limits for the batch size in Talend Data Stewardship (on premises). However, do not exceed 200 tasks per commit in Talend Cloud Data Stewardship, otherwise the Job fails.
-
In the Filtering threshold field, enter a value
between 0.2 and 0.85 to filter the pairs of suspect records based on the
calculated scores.
Did this page help you?
If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!