Skip to main content Skip to complementary content

Matching several records

Blocking

To avoid doing a two-by-two comparison of all the input records, you can define one or many blocking keys to split the input dataset into smaller datasets called blocks.

In each block, the blocking keys must have the same value. Then, each block is processed independently.

Using blocking keys reduces the time needed by the Simple VSR Matcher and the T-Swoosh algorithms to process data. For example, if 100,000 records are split into 100 blocks of 1,000 records each, the number of comparisons are reduced by a factor 100. This means the algorithm runs around 100 times faster.

It is recommended to use the tGenKey component to generate blocking keys and to visualize the statistics regarding the number of blocks. In a Job, right-click the tGenKey component and select View Key Profile in the contextual menu to visualize the distribution of the number of blocks according to their size.

In this example, the average block size is around 40.

For the 13 blocks with 38 rows, there are 18,772 comparisons in these 13 blocks (13 × 382). If records are compared with four columns, this means there will be 75,088 string comparisons in these 13 blocks (18,772 × 4).

The Simple VSR Matcher algorithm

The Simple VSR Matcher algorithm compares each record within same block with the previous master records in the lookup table.

If a record does not match any of the previous master records, it is considered as a new master record and added to the lookup table. This means that the first record of the dataset is necessarily a master record. So, the order of the records is important and can have an impact on the creation process of the master records.

When a record matches a master record, the Simple VSR Matcher algorithm does not further attempt to match with other master records because all the master records in the lookup table are not similar. So, once a record matches a master record, the chance of matching another master record is low.

This means a record can only exist in one group of records and be linked to one master record.

For example, take the following set of records as input:

id fullName
1 John Doe
2 Donna Lewis
3 John B. Doe
4 Louis Armstrong

The algorithm processes the input records as follows:

  1. The algorithm takes record 1 and compares it with an empty set of records. Since record 1 does not match any record, it is added to the lookup table.
  2. The algorithm takes record 2 and compares it with record 1. Since it is not a match, record 2 is added to the lookup table.
  3. The algorithm takes record 3 and compares it with record 1 and record 2. Record 3 matches record 1. So, record 3 is added to the group of record 1.
  4. The algorithm takes record 4 and compares it with record 1 and record 2 but not with record 3, which is not a master record. Since it is not a match, record 4 is added to the lookup table.

The output will look like this:

id fullName Grp_ID Grp_Size Master Score GRP_QUALITY
1 John Doe 0 2 true 1.0 0.72
3 John B. Doe 0 0 false 0.72 0
2 Donna Lewis 1 1 true 1.0 1.0
4 Louis Armstrong 2 1 true 1.0 1.0

The T-Swoosh algorithm

The T-Swoosh algorithm is based on the same idea as the Simple VSR Matcher algorithm, but it creates a master record instead of considering existing records to be master records.

The order of the input records does not impact the matching process.

Information noteRestriction: The T-Swoosh algorithm can only be used with the Standard component. When you need to use the Apache Spark Batch component, use the Simple VSR Matcher algorithm.

To create master records, you can design survivorship rules to decide which attribute will survive.

There are two types of survivorship rules:

  • The rules related to matching keys: each attribute used as a matching key can have a specific survivorship rule.
  • The default rules: they are applied to all the attributes of the same data type (Boolean, String, Date, Number).

If a column is a matching key, the rule related to matching keys specific to this column is applied.

If the column is not a matching key, the default survivorship rule for this data type is applied. If the default survivorship rule is not defined for the data type, the Most common survivorship function is used.

Each time two records are merged to create a new master record, this new master record is added to the queue of records to be examined. The two records that are merged are removed from the lookup table.

For example, take the following set of records as input:

id fullName
1 John Doe
2 Donna Lewis
3 John B. Doe
4 Johnnie B. Doe

The survivorship rule uses the Concatenate function with "," as a parameter to separate values.

At the beginning of the process, the queue contains all the input records and the lookup is empty. To process the input records, the algorithm iterates until the queue is empty:

  1. The algorithm takes record 1 and compares it with an empty set of records. Since record 1 does not match any record, it is added to the set of master records. The queue contains now record 2, record 3 and record 4. The lookup contains record 1.
  2. The algorithm takes record 2 and compares it with record 1. Since record 2 does not match any record, it is added to the set of master records. The queue contains now record 3 and record 4. The lookup contains record 1 and record 2.
  3. The algorithm takes record 3 and compares it with record 1. Record 3 matches record 1. So, record 1 and record 3 are merged to create a new master record called record 1,3. The queue contains now record 4 and record 1,3. The lookup contains record 2.
  4. The algorithm takes record 4 and compares it with record 2. Since it is not a match, record 4 is added to the set of master records. The queue contains now record 1,3. The lookup table contains record 2 and record 4.
  5. The algorithm takes record 1,3 and compares it with record 2 and record 4. Record 1,3 matches record 4. So, record 1,3 and record 4 are merged to create a new master record called record 1,3,4. Record 4 is removed from the lookup table. Since record 1,3 was the result of a previous merge, it is removed from the table. The queue now contains record 1,3,4. The lookup contains record 2.
  6. The algorithm takes record 1,3,4 and compares it with record 2. Since it is not a match, record 1,3,4 is added to the set of master records. The queue is now empty. The lookup contains records 1,3,4 and record 2.

The output will look like this:

id fullName GRP_ID GRP_SIZE MASTER SCORE GRP_QUALITY
1,3,4 John Doe, John B. Doe, Johnnie B. Doe 0 3 true 1.0 0.449
1 John Doe 0 0 false 0.72 0
3 John B. Doe 0 0 false 0.72 0
4 Johnnie B. Doe 0 0 false 0.78 0
2 Donna Lewis 1 1 true 1.0 1.0

As you can see in this example, the value in the GRP_QUALITY column can be less than the Match Threshold parameter. That is because a group is created from record pairs with a matching score greater than or equal to the Match Threshold but the records are not all compared to each other; whereas GRP_QUALITY takes into account all record pairs in the group.

The differences between the Simple VSR Matcher and the T-Swoosh algorithms

When processing the input data using the T-Swoosh algorithm, there may be more iterations than the number of input records because a merged record may be created on each iteration and added to the queue.

This is one of the main differences between the Simple VSR Matcher and the T-Swoosh algorithms.

When comparing a record with a master record, the T-Swoosh algorithm makes more comparisons per iteration than the Simple VSR matcher algorithm:
  • When using the Simple VSR matcher algorithm, the record from the queue is only compared with the value of the master record. There is no comparison between the record from the queue and the value of each of the records used to build this master record. Then, sort the input records so that the most trustworthy records appear first in the input data.
  • When using the T-Swoosh algorithm, the record from the queue is compared with the value of the master record and the value of each of the records used to build this master record, until records are considered a match.

    For an example of how to survive master records using the T-Swoosh algorithm, see The T-Swoosh algorithm.

    In this example, the record "John Doe, John B. Doe" is compared with the record "John B. Doe" on iteration 5. There is a match if at least one of the three strings "John Doe, John B. Doe", "John Doe" and "John B. Doe" matches the string "Johnnie B. Doe".

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!