Clustering with K-Means in Machine Learning in Python

K-Means clustering header

K-means is one of the most popular and widely used clustering algorithms in the field of machine learning and data analytics. It is an unsupervised learning algorithm that aims to partition a dataset into K distinct clusters. The term “K” in K-means indicates the desired number of clusters.

K-Means algorithm

The K-means algorithm was first proposed by Stuart Lloyd in 1957 as a vector quantization method for transmitting low-bandwidth signals. However, the name “K-means” was coined by James MacQueen in 1967, and the algorithm was further developed and popularized in subsequent years. Since the introduction of MacQueen, the K-means algorithm has undergone numerous developments and improvements. Several variants have been proposed to address challenges related to sensitivity to centroid initialization, convergence, and scalability. Over the years, the K-means algorithm has become widely used in a variety of fields, including machine learning, data analytics, computer vision, bioinformatics, and more. Its conceptual simplicity and effectiveness have made it one of the most popular and used clustering algorithms.

In general, K-means is useful when you want to explore the inherent structure of an unlabeled data set or when you want to group data so you can make more informed decisions or extract useful insights from the data itself. However, it is important to note that K-means has some limitations, such as sensitivity to the choice of the number of clusters and the initial distribution of centroids, which must be taken into account when analyzing the results.

It the various phases of the K-Means algorithm

Initialization: Initially, K initial centroids are randomly chosen, one for each cluster. These centroids can be randomly selected from the data points themselves or from a random distribution.

Assignment: For each data point ( x_i ), its distance to each centroid is calculated using a distance metric, usually Euclidean distance. The formula to calculate the Euclidean distance between two points ( x_i ) and ( c_j ) is:

 \text{distanza}(x_i, c_j) = \sqrt{\sum_{d=1}^{D} (x_{i,d} - c_{j,d})^2}

where ( D ) is the number of dimensions of the data and ( x_{i,d} ) and ( c_{j,d} ) are the coordinates of the point ( x_i ) and centroid ( c_j ) in dimension ( d ). Once the distances are calculated, the point ( x_i ) is assigned to the cluster whose centroid is closest.

Recalculating Centroids: After all points have been assigned to a cluster, the centroids are recalculated as the average of the points assigned to each cluster. The formula to calculate the new centroid (c_j) for cluster (j) is:

 c_j = \frac{1}{N_j} \sum_{i=1}^{N_j} x_i

where (N_j) is the number of points in cluster (j) and (x_i) are the points assigned to cluster (j).

Iteration: The centroid assignment and recalculation steps are repeated iteratively until the centroids significantly change position or until a maximum number of iterations is reached.

Convergence: The algorithm converges when the centroids converge to fixed positions and the clusters become stable.

This is the main process of K-means. Choosing the optimal number of clusters ( K ) is important and can be determined using methods such as the elbow method or cross-validation.

K-Means with Scikit-Learn

The K-means algorithm is implemented in the scikit-learn library, which is one of the most popular libraries for machine learning in Python. Scikit-learn offers an efficient and flexible implementation of the K-means algorithm through the KMeans class.

Let’s now try to create a simple example to see how to implement and use the K-means algorithm with the scikit-learn library in Python.

First, let’s think about the starting data. In this regard, we can use make_blobs. This is a feature provided by the scikit-learn library that allows you to generate synthetic datasets often used for testing and demonstration purposes in machine learning algorithms. This function generates clusters of random points in two-dimensional or three-dimensional space, with the specified cluster centroids and standard deviation of the clusters.

In practice, make_blobs generates a specified number of random samples distributed around specified centroids, with a specified standard deviation controlling the dispersion of points within each cluster.

This feature is useful for creating test datasets with different levels of complexity and cluster structures, allowing developers and researchers to test and compare clustering, classification, and other machine learning algorithms on synthetic data before applying them to real data.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# Generate random data using make_blobs function
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Visualize the generated data
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Generated Data')
plt.show()

By running the code we will obtain the graphical representation of the points generated by make_blobs.

K-Means in Machine Learning - make blobs scatter plot

As you can see, the make_blobs function generated random data already grouped into 4 differentiated groups. Now we can move on to applying the K-means algorithm for clustering on them.

kmeans = KMeans(n_clusters=4)
kmeans.fit(X)

Here we are instantiating a KMeans object from the scikit-learn library with n_clusters=4, which means we want to find 4 clusters in our data. Next, we call the fit(X) method on this object to fit the model to the previously generated X data.

labels = kmeans.labels_
centroids = kmeans.cluster_centers_

After fitting the model to the data, we obtain two important results:

  • labels: This is an array that contains the cluster labels assigned to each data point. Each label indicates which cluster the respective point belongs to.
  • centroids: This is a matrix that contains the coordinates of the centroids of the clusters found by the algorithm. Centroids are the “midpoints” of the clusters, and represent the central locations of the data groups.

Now let’s report all the necessary code:

# Apply K-means algorithm for clustering
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)

# Get cluster labels and centroids
labels = kmeans.labels_
centroids = kmeans.cluster_centers_

# Visualize clusters and centroids found
plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis')
plt.scatter(centroids[:, 0], centroids[:, 1], marker='*', c='red', s=200, label='Centroids')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Clustering with K-means')
plt.legend()
plt.show()

Executing you get the following result:

K-Means in Machine Learning - clustering

As we can see we have obtained a graph in which the 4 clusters identified with 4 different colors are highlighted. Furthermore, the centroids of each cluster are reported, with a star marker.

Evaluation of the results obtained

Clustering is an unsupervised task, meaning we have no known class labels to directly evaluate model performance. However, there are several metrics and techniques that can be used to evaluate the quality of the obtained clusters.

Let’s now use some of these evaluation metrics to evaluate the clustering obtained with the K-means algorithm in the previous example.

  • Silhouette Score: This metric provides a measure of cluster cohesion and separation. A higher value indicates that the clusters are well separated.
  • Davies-Bouldin Score: This metric measures the “dispersion” between clusters. A lower value indicates more compact and well separated clusters.
from sklearn.metrics import silhouette_score, davies_bouldin_score

silhouette_avg = silhouette_score(X, labels)
print("Silhouette Score:", silhouette_avg)

davies_bouldin = davies_bouldin_score(X, labels)
print("Davies-Bouldin Score:", davies_bouldin)

It is important to note that these metrics are calculated using only the data and cluster labels, without requiring actual class labels. This makes them very useful for evaluating unsupervised clustering. Running you get results similar to the following:

Silhouette Score: 0.6819938690643478
Davies-Bouldin Score: 0.43756400782378385

How should we evaluate them?

Silhouette Score

This value ranges from -1 to 1. A score closer to 1 indicates that the samples are well assigned to their clusters and that the clusters are well separated from each other. A score closer to -1 indicates that many samples were assigned to the wrong clusters. A score close to 0 indicates that the clusters may overlap. In our case, a Silhouette Score of 0.68 suggests that the clusters are well separated and that the samples are well assigned to their clusters. This is a good result and indicates a good clustering structure.

Davies-Bouldin Score

This value is a ratio of the average intra-cluster “dispersion” to the inter-cluster “dispersion”. A lower value indicates more compact and well separated clusters. In our case, a Davies-Bouldin Score of 0.44 suggests that the clusters are quite compact and well separated. This is also a good result and confirms the cohesion and separation of the clusters.

In general, evaluating these two scores together, we can say that the clustering has produced positive results and that the clusters are well defined and separated from each other. However, it is always advisable to compare these results with other evaluation methods and domain knowledge for a more comprehensive evaluation.

Let’s move on to the visual analysis:

Let’s start by looking at the cluster visualization. After running the K-means algorithm, we plotted the colored data points based on the clusters to which they were assigned. Each color represents a different cluster. Furthermore, we marked the centroids of the clusters with red stars. Looking at the cluster view, it appears that the data has been grouped together consistently. The points within each cluster appear to be quite close to each other, while the centroids appear to be at the center of their respective clusters.

In summary, looking at both the cluster visualization and the evaluation metric scores, we can conclude that the clustering obtained with the K-means algorithm appears to be significant and of good quality. The clusters are well defined and separated, and the centroids are positioned significantly relative to the data.

A more real case: the problem of the number of clusters

Nell’esempio precedente abbiamo scelto arbitrariamente di utilizzare 4 cluster, perchè sapevamo che la funzione make_blobs ci avrebbe fornito 4 cluster. Nei casi reali, però il numero dei cluster presenti ci è ovviamente sconosciuta. Tuttavia, la scelta del numero ottimale di cluster è una decisione cruciale nel clustering e può influenzare significativamente i risultati. È importante considerare tecniche come il metodo del gomito o il metodo della silhouette per determinare il numero ottimale di cluster in base alla struttura dei dati.

The Elbow Method

This time, we start from the assumption that we do not know the optimal number of clusters obtainable from our starting dataset. How to find out the number of clusters with which to apply K-Means? Well, there is the elbow method and let’s see how to apply it to our example.

To do this, the inertias are calculated. Inertias (or “inertias” in English) are a measure of the cohesion of clusters in a clustering algorithm. In the context of the K-means algorithm, inertia represents the sum of the squared distances of the data points with respect to the centroids of their respective clusters. In other words, it is a measure of how close the points within each cluster are to the cluster’s centroid.

More specifically, inertia is calculated as the sum of the squares of the distances between each data point and the centroid of the cluster to which it belongs. A lower value of inertia indicates that the points within the clusters are closer to the centroid, so the clusters are more compact and coherent.

In the context of the elbow method for choosing the optimal number of clusters, inertia is useful because it gives us a measure of the overall “quality” of clustering for a given number of clusters. In plotting inertia against the number of clusters, we can look for the point where inertia stops decreasing dramatically, suggesting the optimal number of clusters to use. This point often forms an “elbow” in the graph, which is where the method name comes from.

Let’s write the code that calculates the inertias on our dataset. In our case, we consider a number of possible clusters between 1 and 10.

# List to store inertias
inertias = []

# Try a number of clusters from 1 to 10
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i)
    kmeans.fit(X)
    # Calculate inertia and store in the list
    inertias.append(kmeans.inertia_)

# Plot inertia versus number of clusters
plt.plot(range(1, 11), inertias, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method for Choosing Number of Clusters')
plt.show()

Executing you get the following result:

K-Means in Machine Learning - elbow method

We obtained the characteristic “elbow” where the number 4 represents the point corresponding to the elbow. This number is the number of optimal clusters identifiable in our dataset.

Calculating inertia for different cluster configurations can be computationally expensive, especially when working with large amounts of data or a high number of dimensions. Because inertia involves calculating distances between each data point and the corresponding cluster centroid, the number of operations required can increase rapidly with the number of data points and the number of clusters.

However, for many applications, elbow method analysis is still feasible. Often, it is sufficient to perform this calculation on a representative subset of the data rather than the entire data set. Furthermore, the exponential growth in computation time can be mitigated using optimization and parallelization techniques, especially when using efficient libraries such as scikit-learn in Python.

In general, it is important to balance computational complexity with the accuracy of decisions made via the elbow method. You may need to perform a more detailed analysis of the optimal number of clusters only when greater precision is required and when you have the necessary computational resources to do so.

The Silhouette Method

The silhouette method is another technique used to evaluate the cohesion and separation of clusters and can be used to determine the optimal number of clusters in a dataset. This method provides a measure of the overall clustering quality for a given number of clusters, and the optimal number of clusters can be identified by maximizing the average silhouette value across all samples.

For each data point in the set, we calculate the silhouette score. A point’s silhouette score measures how similar its cluster is to other clusters. It is calculated as:

 s_i = \frac{b_i - a_i}{\max{(a_i, b_i)}}

where:

(s_i) is the silhouette score of point (i).

( a_i ) is the average distance of point ( i ) to other points in the same cluster.

( b_i ) is the average distance of point ( i ) to points in the nearest cluster other than the one to which point ( i ) belongs.

After calculating the silhouette score for each point, we calculate the average silhouette score across all points. This gives us an overall measure of clustering quality for a given number of clusters.

We repeat the process of clustering and calculating the silhouette score for different values of K (number of clusters). The optimal number of clusters is the one that maximizes the average value of the silhouette score.

In practice, we can plot the average value of the silhouette score as a function of the number of clusters and identify the point where the value is maximum. This is often referred to as the optimal number of clusters to use.

Let’s see this method applied to our example:

from sklearn.metrics import silhouette_score

# List to store silhouette scores
silhouette_scores = []

# Try a number of clusters from 2 to 10
for i in range(2, 11):
    kmeans = KMeans(n_clusters=i)
    kmeans.fit(X)
    # Calculate silhouette scores and store in the list
    silhouette_avg = silhouette_score(X, kmeans.labels_)
    silhouette_scores.append(silhouette_avg)

# Plot silhouette scores versus number of clusters
plt.plot(range(2, 11), silhouette_scores, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('Average Silhouette Score')
plt.title('Silhouette Method for Choosing Number of Clusters')
plt.show()

In this code, we are calculating the silhouette score for a number of clusters from 2 to 10. For each value of i, we train a K-means model with the current number of clusters and calculate the average silhouette score for the data points using the function silhouette_score from the scikit-learn library. The average silhouette scores are then stored in the silhouette_scores list. Finally, we plot the average silhouette scores against the number of clusters using plt.plot(), and identify the optimal number of clusters where the average silhouette score is maximum.

Executing the following graph is obtained:

K-Means in Machine Learning - silhouette method

The optimal number of clusters is the one that maximizes the average value of the silhouette score. We can identify the number of clusters at the peak of the silhouette score graph, which also in this case corresponds to the value 4.

Leave a Reply