Matching several records
Blocking
In each block, the blocking keys must have the same value. Then, each block is processed independently.
Using blocking keys reduces the time needed by the Simple VSR Matcher and the T-Swoosh algorithms to process data. For example, if 100,000 records are split into 100 blocks of 1,000 records each, the number of comparisons are reduced by a factor 100. This means the algorithm runs around 100 times faster.
It is recommended to use the tGenKey component to generate blocking keys and to visualize the statistics regarding the number of blocks. In a Job, right-click the tGenKey component and select View Key Profile in the contextual menu to visualize the distribution of the number of blocks according to their size.
In this example, the average block size is around 40.
For the 13 blocks with 38 rows, there are 18,772 comparisons in these 13 blocks (13 × 382). If records are compared with four columns, this means there will be 75,088 string comparisons in these 13 blocks (18,772 × 4).
The Simple VSR Matcher algorithm
If a record does not match any of the previous master records, it is considered as a new master record and added to the lookup table. This means that the first record of the dataset is necessarily a master record. So, the order of the records is important and can have an impact on the creation process of the master records.
When a record matches a master record, the Simple VSR Matcher algorithm does not further attempt to match with other master records because all the master records in the lookup table are not similar. So, once a record matches a master record, the chance of matching another master record is low.
This means a record can only exist in one group of records and be linked to one master record.
For example, take the following set of records as input:
id | fullName |
---|---|
1 | John Doe |
2 | Donna Lewis |
3 | John B. Doe |
4 | Louis Armstrong |
The algorithm processes the input records as follows:
- The algorithm takes record 1 and compares it with an empty set of records. Since record 1 does not match any record, it is added to the lookup table.
- The algorithm takes record 2 and compares it with record 1. Since it is not a match, record 2 is added to the lookup table.
- The algorithm takes record 3 and compares it with record 1 and record 2. Record 3 matches record 1. So, record 3 is added to the group of record 1.
- The algorithm takes record 4 and compares it with record 1 and record 2 but not with record 3, which is not a master record. Since it is not a match, record 4 is added to the lookup table.
The output will look like this:
id | fullName | Grp_ID | Grp_Size | Master | Score | GRP_QUALITY |
---|---|---|---|---|---|---|
1 | John Doe | 0 | 2 | true | 1.0 | 0.72 |
3 | John B. Doe | 0 | 0 | false | 0.72 | 0 |
2 | Donna Lewis | 1 | 1 | true | 1.0 | 1.0 |
4 | Louis Armstrong | 2 | 1 | true | 1.0 | 1.0 |
The T-Swoosh algorithm
The T-Swoosh algorithm is based on the same idea as the Simple VSR Matcher algorithm, but it creates a master record instead of considering existing records to be master records.
The order of the input records does not impact the matching process.
To create master records, you can design survivorship rules to decide which attribute will survive.
There are two types of survivorship rules:
- The rules related to matching keys: each attribute used as a matching key can have a specific survivorship rule.
- The default rules: they are applied to all the attributes of the same data type (Boolean, String, Date, Number).
If a column is a matching key, the rule related to matching keys specific to this column is applied.
If the column is not a matching key, the default survivorship rule for this data type is applied. If the default survivorship rule is not defined for the data type, the Most common survivorship function is used.
Each time two records are merged to create a new master record, this new master record is added to the queue of records to be examined. The two records that are merged are removed from the lookup table.
For example, take the following set of records as input:
id | fullName |
---|---|
1 | John Doe |
2 | Donna Lewis |
3 | John B. Doe |
4 | Johnnie B. Doe |
The survivorship rule uses the Concatenate function with "," as a parameter to separate values.
At the beginning of the process, the queue contains all the input records and the lookup is empty. To process the input records, the algorithm iterates until the queue is empty:
- The algorithm takes record 1 and compares it with an empty set of records. Since record 1 does not match any record, it is added to the set of master records. The queue contains now record 2, record 3 and record 4. The lookup contains record 1.
- The algorithm takes record 2 and compares it with record 1. Since record 2 does not match any record, it is added to the set of master records. The queue contains now record 3 and record 4. The lookup contains record 1 and record 2.
- The algorithm takes record 3 and compares it with record 1. Record 3 matches record 1. So, record 1 and record 3 are merged to create a new master record called record 1,3. The queue contains now record 4 and record 1,3. The lookup contains record 2.
- The algorithm takes record 4 and compares it with record 2. Since it is not a match, record 4 is added to the set of master records. The queue contains now record 1,3. The lookup table contains record 2 and record 4.
- The algorithm takes record 1,3 and compares it with record 2 and record 4. Record 1,3 matches record 4. So, record 1,3 and record 4 are merged to create a new master record called record 1,3,4. Record 4 is removed from the lookup table. Since record 1,3 was the result of a previous merge, it is removed from the table. The queue now contains record 1,3,4. The lookup contains record 2.
- The algorithm takes record 1,3,4 and compares it with record 2. Since it is not a match, record 1,3,4 is added to the set of master records. The queue is now empty. The lookup contains records 1,3,4 and record 2.
The output will look like this:
id | fullName | GRP_ID | GRP_SIZE | MASTER | SCORE | GRP_QUALITY |
---|---|---|---|---|---|---|
1,3,4 | John Doe, John B. Doe, Johnnie B. Doe | 0 | 3 | true | 1.0 | 0.449 |
1 | John Doe | 0 | 0 | false | 0.72 | 0 |
3 | John B. Doe | 0 | 0 | false | 0.72 | 0 |
4 | Johnnie B. Doe | 0 | 0 | false | 0.78 | 0 |
2 | Donna Lewis | 1 | 1 | true | 1.0 | 1.0 |
As you can see in this example, the value in the GRP_QUALITY column can be less than the Match Threshold parameter. That is because a group is created from record pairs with a matching score greater than or equal to the Match Threshold but the records are not all compared to each other; whereas GRP_QUALITY takes into account all record pairs in the group.
The differences between the Simple VSR Matcher and the T-Swoosh algorithms
This is one of the main differences between the Simple VSR Matcher and the T-Swoosh algorithms.
- When using the Simple VSR matcher algorithm, the record from the queue is only compared with the value of the master record. There is no comparison between the record from the queue and the value of each of the records used to build this master record. Then, sort the input records so that the most trustworthy records appear first in the input data.
- When using the T-Swoosh algorithm, the record from the queue is compared with the
value of the master record and the value of each of the records used to build this
master record, until records are considered a match.
For an example of how to survive master records using the T-Swoosh algorithm, see The T-Swoosh algorithm.
In this example, the record "John Doe, John B. Doe" is compared with the record "John B. Doe" on iteration 5. There is a match if at least one of the three strings "John Doe, John B. Doe", "John Doe" and "John B. Doe" matches the string "Johnnie B. Doe".