Machine Learning is a broad field in which we can distinguish different types of algorithms. One of them is unsupervised learning, where the learning data is unlabelled. Today I would like to introduce you to K-Means Clustering which is part of unsupervised algorithms that is used to solve clustering problems. I will explain a little theory and show how to implement this algorithm in C# using ML.NET.
What is Clustering?
For example, let’s assume that you have a lot of data at your disposal regarding visitors to your online store and would like to identify groups of similar users. You can do this by using clustering and check which group a particular customer visitor falls into. For instance, the algorithm might notice that 30% of visitors are middle-aged women and 20% are sports fans.
This method represents a group of clustering methods. Furthermore, it is a non-hierarchical algorithm. In simple terms, this means that the number of clusters must be specified here before running the algorithm. The algorithm consists of moving objects from cluster to cluster until intra-cluster and inter-cluster variability is optimised. The main rule is that the similarity within a cluster should be as high as possible, while separate clusters should differ from each other as much as possible. The way that algorithm works is shown in the diagram below:
The algorithm can be presented in the following steps:
- Select the number K to determine the number of clusters.
- Choose random K points or centroids.
- Match each data point to the nearest centroid, which will set up the predefined K clusters.
- Compute the variance and put a new centroid of each cluster.
- Reassign each data point to the new nearest centroid of each cluster.
- If the assignment is changed, go to step 4, otherwise, finish.
- The model is completed.
Example in C#
I used the wine dataset from Kaggle for the experiment. The analyzed data set has 13 features and is adapted from the wine dataset from the UCI Machine Learning Repository by removing the information about the types of wine for unsupervised learning. After creating a new console application project, you should create a model that corresponds to the attributes in the dataset and a class that represents the output of the clustering model applied. Created classes are shown in the listing:
Then you should write the code which will load the dataset and comply with the model structure to the standards adopted by the ML.NET library. This means that the remaining attributes must be condensed under the name Features.
Now you might go to the code that creates a pipeline, then trains a previously created model and finally evaluates clustering. ML.NET provides us with three metrics to assess the quality of clustering: Average Distance, Davies-Bouldin Index and Normalized Mutual Information. The first one is a measure of the proximity of examples to cluster centroids. Davies-Bouldin Index is a measure of the degree of cluster dispersion and cluster separation and Normalized Mutual Information is about the interdependence of the variables. This metric is only calculated when the Label column is specified.
In the end, you can create a prediction engine and check how the K-means algorithm deals with another example:
The class responsible for prediction provides information on what cluster has been assigned to the example and squared distances from that instance to the cluster centroids.
In this blog post, I introduced you to the K-means algorithm and showed you how to use it using ML.NET. Its use in ML.NET is limited to a few lines of code, but I think it’s worth having some knowledge of what this algorithm is about.