tKMeansModel properties for Apache Spark Batch
These properties are used to configure tKMeansModel running in the Spark Batch Job framework.
The Spark Batch tKMeansModel component belongs to the Machine Learning family.
This component is available in Talend Platform products with Big Data and in Talend Data Fabric.
Basic settings
Vector to process |
Select the input column used to provide feature vectors. Very often, this column is the output of the feature engineering computations performed by tModelEncoder. |
Save the model on file system |
Select this check box to store the model in a given file system. Otherwise, the model is stored in memory. The button for browsing does not work with the Spark Local mode; if you are using the Spark Yarn or the Spark Standalone mode, ensure that you have properly configured the connection in a configuration component in the same Job, such as tHDFSConfiguration. |
Number of clusters (K) |
Enter the number of clusters into which you want tKMeansModel to group data. In general, a large number of clusters can decreases errors in predictions but increases the risk of overfitting. Therefore, it is recommended to put a reasonable number based on how many potential clusters you think, by observation for example, the data to be processed might contain. |
Set distance threshold of the convergence (Epsilon) |
Select this check box and in the Epsilon field that is displayed, enter the convergence distance you want to use. The model training is considered accomplished once all of the cluster centers move less than this distance. If you leave this check box clear, the default convergence distance 0.0001 is used. |
Set the maximum number of runs |
Select this check box and in the Maximum number of runs field that is displayed, enter the number of iterations you want the Job to perform to train the model. If you leave this check box clear, the default value 20 is used. |
Set the number of parallelized runs |
This setting is not available from Apache Spark 3.0 onwards. Select this check box and in the Number of parallelized runs field that is displayed, enter the number of iterations you want the Job to run in parallel. If you leave this check box clear, the default value 1 is used. This actually means that the iterations will be run in succession. Note that this parameter helps you optimize the use of your resources for the computations but does not impact the prediction performance of the model. |
Initialization function |
Select the mode to be used to select the points as initial cluster centers.
|
Set the number of steps for the initialization |
Select this check box and in the Steps field that is displayed, enter the number of initialization rounds to be run for the optimal initialization result. If you leave this check box clear, the default value 5 is used. 5 rounds are almost always enough for the K-Means|| mode to obtain the optimal result. |
Define the random seed |
Select this check box and in the Seed field that is displayed, enter the seed to be used for the initialization of the cluster centers. |
Advanced settings
Display the centers after the processing |
Select this check box to output the vectors of the cluster centers into the console of the Run view. This feature is often useful when you need to understand how the cluster centers move in the process of training your K-Means model. |
Usage
Usage rule |
This component is used as an end component and requires an input link. You can accelerate the training process by adjusting the stopping conditions such as the maximum number of runs or the convergence distance but note that the training that stops too early could impact its performance. |
Model evaluation |
The parameters you need to set are free parameters and so their values may be provided by previous experiments, empirical guesses or the like. They do not have any optimal values applicable for all datasets. Therefore, you need to train the relationship model you are generating with different sets of parameter values until you can obtain the best evaluation result. But note that you need to write the evaluation code yourself to rank your model with scores. |