Skip to main content

Improving a matching model

You can improve a matching model by changing the settings of the tMatchModel component.

As the result depends on your database, there is no ideal settings. The purpose of the following tests is to show you that setting up the parameters differently can improve the model quality.

Information noteImportant: Changing the settings can also affect the model quality.
In the following examples, we use a database of childcare centers that contains the following input data:
  • The site name,
  • The address and
  • The source of the previous data.

The reference settings are:

To perform these tests, the following method was applied: parameters were set differently one at a time. If the model quality increased, the setting was kept and another parameter was set differently. This is a good method to see how a parameter impacts the model.

Only the settings changed. As tested in Analyzing the heat map, changing the matching key impacts the model quality. Address and Site name were set as the matching keys.

For more information on the parameters, see their description in the tMatchModel properties for Apache Spark Batch.

After running multiple Jobs, the highest model quality is: 0.942.

The following table shows the settings that have been tested:
Parameters Reference setting Tested settings The model quality is better when set to
Number of trees range 1 5 to 15

5 to 20, 5 to 30, 5 to 50, 5 to 100

5 to 30, 5 to 50 or 5 to 100
Subsampling Rate 1.0 0.5 1.0
Impurity Gini Entropy Entropy
Max Bins 32 15 and 79 79
Subset strategy auto All (auto, all, sqrt and log2) auto
Min Instances per Node 1 3 and 10 1
1 The larger is the range of the hyper-parameters (number of trees and tree depth), the longer is the Job duration.

Notice that the Evaluation metric type parameter has not been changed. It remained set to F1. As the calculation is different from an evaluation metric type to another, changing this setting is irrelevant in those examples.

During the tests, no particular setting made the model quality increase from 0.917 to 0.942 but the combination of the different settings did.

The preceding results apply to a specific database. Depending on your database, changing the settings as above does not have the same impact. The purpose is to show you that, even if a model quality is satisfying, you can try other settings to improve the matching model.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!