The 21st century is often referred to as the age of “Big Data” due to the unprecedented increase in the volumes of data being generated. As most of this data comes without labels, making sense of it is a non-trivial task. To gain insight from unlabelled data, unsupervised machine learning algorithms have been developed and continue to be refined. These algorithms determine underlying relationships within the data by grouping data points into cluster families. The resulting clusters not only highlight associations within the data, but they are also critical for creating predictive models for new data.
K-Means is one of the most commonly used clustering methods, particularly because it implements a very simple algorithm. To use K-Means clustering, the user needs to assign a value for K, which corresponds to the number of clusters in the dataset. Considering a two-dimensional example with K = 2, the algorithm:
- Creates 2 centroids at random (Stars in Fig B).
- Generates 2 clusters by assigning each data point to its nearest centroid (Red and Blue Squares in B).
- Calculates new centroids for each of the 2 clusters (C).
- Reassigns data points using the new set of centroids (C).
- If there are any new assignments, goes back to step 3
- If no new assignments occur, the algorithm terminates (D)
Determining K
The major limitation of K-Means is that it requires the user to know the number of clusters associated with the data beforehand. However, determining the underlying number of clusters is challenging. One of the approaches that attempt to solve this problem relies on the Within Cluster Sum of Squares (WCSS) also known as Inertia. In the WCSS equation (1), x denotes each data point while C denotes the centroids.
To use this approach, one would determine WCSS for a range of cluster numbers and plot WCSS vs cluster number. K is then chosen to be the cluster number corresponding to the elbow of the plot or the cluster number after which the drop in WCSS starts to dampen.
Limitations
K-Means clustering works well in cases where the data is well separated and spherical/circular in shape. The algorithm struggles to cluster datasets with spiral shapes or varying densities.
Further Information
For comprehensive information and python implementation of K-Means clustering, check out the following links: