Tuning hyper-parameters and using K-fold cross-validation to improve the matching model
Testing the model using the K-fold cross-validation technique
The K-fold cross-validation technique consists of assessing how good the model will be on an independent dataset.
To test the model, the dataset is split into k subsets and the Random forest algorithm is ran k times:
- At each iteration, one of the k subsets is retained as the validation set and the remaining k-1 subsets are the training set.
- A score for each of the k runs is computed and then the scores obtained are averaged to calculate a global score.
Tuning the Random forest algorithm hyper-parameters using grid search
You can specify values for the two Random forest algorithm hyper-parameters:
- The number of decision trees
- The maximum depth of a decision tree
To improve the quality of the model and tune the hyper-parameters, grid search builds models for each combination of the two Random forest algorithm hyper-parameter values within the limits you specified.
For example:
- The number of trees ranges from 5 to 50 with a step of 5; and
- the tree depth goes from 5 to 10 with a step of 1.
In this example, there will be 60 different combinations (10 × 6).
Only the best combination of the two hyper-parameters values used to train the best model is retained. This measure is reported by the K-fold cross-validation.