Skip to main content Skip to complementary content

Multi-pass matching

You can design a Job with consecutive tMatchGroup components to create data partitions based on different blocking keys.

For example, you want to find duplicates having either the same city or the same zip code in a customer database. In this particular case, you can use two consecutive tMatchGroup to process the data partitions:

  • One tMatchGroup component in which the column "city" is defined as a blocking key.
  • One tMatchGroup component in which the column "ZipCode" is defined as a blocking key.

What is multi-pass matching?

The idea behind multi-pass matching is to reuse the master records defined in the previous pass as the input of the current tMatchGroup component. Multi-pass matching is more effective if the blocking keys are almost not correlated. For example, it is not relevant to define the column "country" as a blocking key and the column "city" as another blocking key because all the comparisons made with the blocking key "city" will also be done with blocking key "country".

When using multi-pass matching with the Simple VSR matcher algorithm, only master records of size 1 - records that did not match any record - are compared with master records of any size. There are no comparisons between two master records that are derived from at least two children each.

An example of multi-pass matching

In the following example, the dataset contains four records. It is assumed that the first tMatchGroup component has a blocking key on the column "ZipCode", and the second tMatchGroup component has a blocking key on the column "city". The attribute "name" is used as a matching key.

id name city ZipCode
1 John Doe Nantes 44000
2 John B. Doe Nantes  
3 Jon Doe Nantes 44000
4 John Doe Nantes  

After the first pass, records 1 and 3 are grouped, and records 2 and 4 are grouped. In these groups, record 1 and record 2 are master records.

In the second tMatchGroup, only the master records from the first pass, record 1 and record 2, are compared. Since their group size is strictly greater than 1, they are not compared. Then, the order in which the input records are sorted is very important.

The following results are returned:

id name city ZipCode GID GRP_SIZE MASTER SCORE GRP_QUALITY
1 John Doe Nantes 44000 0 2 true 1.0 0.875
3 Jon Doe Nantes 44000 0 0 false 0.85 0
2 John B. Doe Nantes   1 2 true 1.0 0.727
4 John Doe Nantes   1 0 false 0.72 0

When running the T-Swoosh algorithm with the same parameters and the Most common survivorship function, the following results are returned:

id name city ZipCode GID GRP_SIZE MASTER SCORE GRP_QUALITY
1 John Doe Nantes 44000 0 4 true 1.0 0.727
1 John Doe Nantes 44000 0 0 true 0.875 0
3 Jon Doe Nantes 44000 0 0 false 0.875 0
2 John B. Doe Nantes   0 0 true 0.72 0
4 John Doe Nantes   1 0 false 0.72 0

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!