Improving a matching model
You can improve a matching model by changing the settings of the tMatchModel component.
As the result depends on your database, there is no ideal settings. The purpose of the following tests is to show you that setting up the parameters differently can improve the model quality.
- The site name,
- The address and
- The source of the previous data.
To perform these tests, the following method was applied: parameters were set differently one at a time. If the model quality increased, the setting was kept and another parameter was set differently. This is a good method to see how a parameter impacts the model.
Only the settings changed. As tested in Analyzing the heat map, changing the matching key impacts the model quality. Address and Site name were set as the matching keys.
For more information on the parameters, see their description in the tMatchModel properties for Apache Spark Batch.
After running multiple Jobs, the highest model quality is: 0.942.
Parameters | Reference setting | Tested settings | The model quality is better when set to |
---|---|---|---|
Number of trees range 1 | 5 to 15 |
5 to 20, 5 to 30, 5 to 50, 5 to 100 |
5 to 30, 5 to 50 or 5 to 100 |
Subsampling Rate | 1.0 | 0.5 | 1.0 |
Impurity | Gini | Entropy | Entropy |
Max Bins | 32 | 15 and 79 | 79 |
Subset strategy | auto | All (auto, all, sqrt and log2) | auto |
Min Instances per Node | 1 | 3 and 10 | 1 |
1 The larger is the range of the hyper-parameters (number of trees and tree depth), the longer is the Job duration. |
Notice that the Evaluation metric type parameter has not been changed. It remained set to F1. As the calculation is different from an evaluation metric type to another, changing this setting is irrelevant in those examples.
During the tests, no particular setting made the model quality increase from 0.917 to 0.942 but the combination of the different settings did.
The preceding results apply to a specific database. Depending on your database, changing the settings as above does not have the same impact. The purpose is to show you that, even if a model quality is satisfying, you can try other settings to improve the matching model.