Multi-pass matching
You can design a Job with consecutive tMatchGroup components to create data partitions based on different blocking keys.
The idea behind multi-pass matching is to reuse the master records defined in the previous pass as the input of the current tMatchGroup component. Multi-pass matching is more effective if the blocking keys are almost not correlated. For example, it is not relevant to define the column "country" as a blocking key and the column "city" as another blocking key because all the comparisons made with the blocking key "city" will also be done with blocking key "country".
When using multi-pass matching with the Simple VSR matcher algorithm, only master records of size 1, records that did not match any record, are compared with master records of any size. There are no comparisons between two master records that are derived from at least two children each.
In the following example, you want to find duplicates having either the same city or the same ZIP code in a customer database. You can use two consecutive tMatchGroup to process the data partitions. The dataset contains four records. It is assumed that the first tMatchGroup component has a blocking key on the column ZipCode, and the second tMatchGroup component has a blocking key on the column city. The attribute name is used as a matching key.
id | name | city | ZipCode |
---|---|---|---|
1 | John Doe | Nantes | 44000 |
2 | John B. Doe | Nantes | _ |
3 | Jon Doe | Nantes | 44000 |
4 | John Doe | Nantes | _ |
The _ character in the ZipCode column represents an empty data. The ZipCode is not provided for the records 2 and 4.
After the first pass, records 1 and 3 are grouped, and records 2 and 4 are grouped. In these groups, record 1 and record 2 are master records.
In the second tMatchGroup, only the master records from the first pass, record 1 and record 2, are compared. Since their group size is strictly greater than 1, they are not compared. Then, the order in which the input records are sorted is very important.
The following results are returned:
id | name | city | ZipCode | GID | GRP_SIZE | MASTER | SCORE | GRP_QUALITY |
---|---|---|---|---|---|---|---|---|
1 | John Doe | Nantes | 44000 | 0 | 2 | true | 1.0 | 0.875 |
3 | Jon Doe | Nantes | 44000 | 0 | 0 | false | 0.85 | 0 |
2 | John B. Doe | Nantes | _ | 1 | 2 | true | 1.0 | 0.727 |
4 | John Doe | Nantes | _ | 1 | 0 | false | 0.72 | 0 |
The _ character in the ZipCode column represents an empty data. As the ZipCode has not been provided in the input, the ZipCode column is empty for the records 2 and 4.
When running the T-Swoosh algorithm with the same parameters and the Most common survivorship function, the following results are returned:
id | name | city | ZipCode | GID | GRP_SIZE | MASTER | SCORE | GRP_QUALITY |
---|---|---|---|---|---|---|---|---|
1 | John Doe | Nantes | 44000 | 0 | 4 | true | 1.0 | 0.727 |
1 | John Doe | Nantes | 44000 | 0 | 0 | true | 0.875 | 0 |
3 | Jon Doe | Nantes | 44000 | 0 | 0 | false | 0.875 | 0 |
2 | John B. Doe | Nantes | _ | 0 | 0 | true | 0.72 | 0 |
4 | John Doe | Nantes | _ | 1 | 0 | false | 0.72 | 0 |
The _ character in the ZipCode column represents an empty data. As the ZipCode has not been provided in the input, the ZipCode column is empty for the records 2 and 4.