K-Means

Understanding the basic K-means algorithm and applying

Halil İbrahim Hatun
6 min readJun 22, 2023

Clustering

Clustering is a technique used in unsupervised machine learning to group similar data points together based on their inherent characteristics or similarities. It aims to find patterns or structures in the data without prior knowledge of the desired output or labels.

The goal of clustering is to divide a dataset into clusters or groups, where data points within the same cluster are more similar to each other than to those in other clusters. The similarity or dissimilarity between data points is typically measured using a distance or similarity metric, such as Euclidean distance or cosine similarity.

K-Means

K-means is a popular clustering algorithm that aims to partition a given dataset into k clusters. It is an iterative algorithm that alternates between two steps: assigning data points to the nearest cluster centroid and updating the cluster centroid based on the assigned data points.

Here’s a high-level overview of the k-means algorithm:

  1. Initialization:
  • Choose the value of k, the number of clusters.
  • Randomly initialize k centroids in the feature space.
Randomly initialized k centroids plot

2. Assignment Step:

  • For each data point, calculate the distance to each centroid. For example, use the Euclidean Distance metric.
Euclidean Distance Formula
Calculate distance plot
  • Assign the data point to the cluster with the nearest centroid.
Assign the nearest centroid plot

3. Move Cluster Centroids Step:

  • Recalculate the centroids of each cluster by taking the mean of all data points assigned to that cluster.
Mean formula
K-means move cluster centroids plot

4. Iteration:

  • Repeat the assignment and move cluster centroids steps until convergence. Convergence occurs when either the centroids do not change significantly or a maximum number of iterations is reached.
K-means process

5. Output:

  • The final output of the k-means algorithm is a set of k clusters, where each data point is assigned to one of the clusters.
Final output plot

Here’s a simplified pseudocode for the k-means algorithm:

k-means pseudocode

Here’s the Python implementation code for the k-means algorithm:

K-means python implementation

How can we decide what the “k” (number of clusters) value is?

Some factors can challenge the efficacy of the final output of the K-means clustering algorithm, and one of them is finalizing the number of clusters (K). Selecting a lower number of clusters will result in underfitting while specifying a higher number of clusters can result in overfitting. Unfortunately, there is no definitive way to find the optimal number.

  1. Elbow Method

It involves plotting the within-cluster sum of squares (WCSS) against the number of clusters and identifying the “elbow” point, which indicates the number of clusters where the rate of improvement in clustering quality starts to diminish significantly.

Elbow Method Visualization

2. Silhouette method

It provides a measure of how well each data point fits into its assigned cluster. The method calculates a silhouette coefficient for each data point, which is a value between -1 and 1.

The silhouette coefficient measures the cohesion and separation of a data point within its cluster. A coefficient close to +1 indicates that the data point is well-matched to its own cluster and poorly matched to neighboring clusters. A coefficient close to 0 suggests that the data point is on or very close to the decision boundary between neighboring clusters. A coefficient close to -1 indicates that the data point may have been assigned to the wrong cluster.

Silhouette method notation
Silhouette Score Formula

3. Gap Statistic Method

It compares the observed within-cluster dispersion to the expected dispersion under a reference null distribution. The larger the gap between the observed and expected dispersions, the more distinct the clusters are considered to be.

Gap statistic = log(observed dispersion) — log(expected dispersion)

The gap statistic method provides a quantitative measure for selecting the number of clusters by comparing the clustering results to a reference null distribution. It helps avoid overfitting or underfitting the data by providing a statistically guided approach to determining the appropriate number of clusters for a given dataset.

Gap Statistic Method Plot

Here’s the implementation code for the gap statistic method in Python:

Gap statistic method code implementation in python

Customer Segmentation Example

Now, let’s perform a customer segmentation operation on the Mall customer dataset in Python.

Let’s take a look at our data.

df.head()
df.dtypes

As you see, In our data, there are three integer features (Age, Annual Income, and Spending Score) to perform clustering.

Let’s look at the distribution of these integer features.

Distribution features without performing k-means

Now, We choose the value of ‘k ’ (the number of clusters).

  1. Elbow Method

If we use the elbow method. The plot between SSE values and the number of clusters looks like this:

Elbow method performed plot

By looking at the given graph, we see that the elbow point is 6, albeit difficult. But, as you can see, the elbow method did not help us determine the number of clusters.

2. Silhouette Method

Let’s try the silhouette method.

Silhouette Method performed plot

As you can see, the Silhouette method chose six clusters as well.

3. Gap Statistic Method

As a last one, We try the gap statistic method.

Gap Statistic Method performed plot

The gap method also chose 6 and 10 values.

Therefore, let’s apply our k-means algorithm to six clusters and visualize the results.

Distribution of features that performed k-means as k equals six

Conclusion

In conclusion, k-means clustering is a powerful unsupervised machine learning algorithm that enables the grouping of data points into distinct clusters based on their similarity. By iteratively optimizing cluster centroids and assigning data points to the nearest centroid, k-means effectively partitions the data space. Its simplicity and efficiency make it a popular choice for various applications, such as customer segmentation, image compression, and anomaly detection. However, k-means has some limitations, such as sensitivity to initial centroid placement and dependence on the number of clusters (k) specified. Despite these challenges, understanding the principles and techniques behind k-means can greatly enhance our ability to extract meaningful insights from complex datasets and pave the way for more advanced clustering algorithms in the field of machine learning.

That’s all I’m going to say. Thank you for reading. If you want to look at the Customer Segmentation Notebook that I showed some plots of in this blog, You can achieve this through here.

I am open to any comments, positive or negative. Don’t miss your comments.

Stay well.

References

--

--