KMeans2D() evaluates the rows of the chart by applying k-means clustering, and for each chart row displays the cluster id of the cluster this data point has been assigned to. The columns that are used by the clustering algorithm are determined by the parameters coordinate_1, and coordinate_2, respectively. These are both aggregations. The number of clusters that are created is determined by the num_clusters parameter. Data can be optionally normalized by the norm parameter.
KMeans2D returns one value per data point. The returned value is a dual and is the integer value corresponding to the cluster each data point has been assigned to.
The aggregation that calculates the first coordinate, usually the x-axis of the scatter chart that can be made from the chart. The additional parameter, coordinate_2, calculates the second coordinate.
norm
The optional normalization method applied to datasets before KMeans clustering.
Possible values:
0 or ‘none’ for no normalization
1 or ‘zscore’ for z-score normalization
2 or ‘minmax’ for min-max normalization
If no parameter is supplied or if the supplied parameter is incorrect, no normalization is applied.
Z-score normalizes data based on feature mean and standard deviation. Z-score does not ensure each feature has the same scale but it is a better approach than min-max when dealing with outliers.
Min-max normalization ensures that the features have the same scale by taking the minimum and maximum values of each and recalculating each datapoint.
In this example, we create a scatter plot chart using the Iris dataset, and then use KMeans to color the data by expression.
We also create a variable for the num_clusters argument, and then use a variable input box to change the number of clusters.
The Iris data set is publicly available in a variety of formats. We have provided the data as an inline table to load using the data load editor in Qlik Sense. Note that we added an Id column to the data table for this example.
After loading the data in Qlik Sense, we do the following:
Drag a Scatter plot chart onto a new sheet. Name the chart Petal (color by expression).
Create a variable to specify the number of clusters. For the variable Name, enter KmeansPetalClusters. For the variable Definition, enter =2.
Configure Data for the chart:
Under Dimensions, choose id for the field for Bubble. Enter Cluster Id for the Label.
Under Measures, choose Sum([petal.length]) for the expression for X-axis.
Under Measures, choose Sum([petal.width]) for the expression for Y-axis.
Data settings for Petal (color by expression) chart
The data points are plotted on the chart.
Data points on Petal (color by expression) chart
Configure Appearance for the chart:
Under Colors and legend, choose Custom for Colors.
Choose to color the chart By expression.
Enter the following for Expression: kmeans2d($(KmeansPetalClusters), Sum([petal.length]), Sum([petal.width]))
Note that KmeansPetalClusters is the variable that we set to 2.
Alternatively, enter the following: kmeans2d(2, Sum([petal.length]), Sum([petal.width]))
Deselect the check box for The expression is a color code.
Enter the following for Label: Cluster Id
Appearance settings for Petal (color by expression) chart
The two clusters on the chart are colored by the KMeans expression.
Clusters colored by expression on Petal (color by expression) chart
Add a Variable input box for the number of clusters.
Under Custom objects in the Assets panel, choose Qlik Dashboard bundle. If we did not have access to the dashboard bundle, we could still change the number of clusters using the variable that we created, or directly as an integer in the expression.
Drag a Variable input box onto the sheet.
Under Appearance, click General.
Enter the following for Title: Clusters
Click Variable.
Choose the following variable for Name: KmeansPetalClusters.
Choose Slider for Show as.
Choose Values, and configure the settings as required,
Appearance for Clusters variable input box
When we are done editing, we can change the number of clusters using the slider in the Clusters variable input box.
Clusters colored by expression on Petal (color by expression) chart
Auto-clustering
KMeans functions support auto-clustering using a method called depth difference (DeD). When a user sets 0 for the number of clusters, an optimal number of clusters for that dataset is determined. Note that while an integer for the number of clusters (k) is not explicitly returned, it is calculated within the KMeans algorithm. For example, if 0 is specified in the function for the value of KmeansPetalClusters or set through a variable input box, cluster assignments are automatically calculated for the dataset based on an optimal number of clusters.
KMeans depth difference method determines optimal number of clusters when (k) is set to 0
Iris data set: Inline load for data load editor in Qlik Sense