Hierarchical clustering in machine learning with Scikit-learn

Hierarchical clustering is a clustering technique used in machine learning to group similar data sets into larger clusters, organizing them into a hierarchical structure. This clustering method was primarily developed to address unsupervised classification problems, where the data is not pre-labeled and the model must find structure in the data on its own.

Hierarchical Clustering

Hierarchical clustering is an approach in machine learning that aims to group similar data sets into clusters based on their similarity. This type of clustering builds a hierarchy of clusters, organizing them into a tree structure. There are two main approaches in hierarchical clustering:

• Agglomerative: Starts by considering each point as a single cluster and then iteratively merges the most similar clusters until all points are grouped into a single cluster.
• Divisive: Starts with a single cluster containing all points and then iteratively divides the clusters into smaller clusters until each point is in a separate cluster.

The history of hierarchical clustering dates back to the 1960s and 1970s, when it was introduced as one of the first clustering techniques. One of the first hierarchical clustering algorithms was proposed by Ward in 1963, called “Ward’s method”, which seeks to minimize the sum of squares of the differences between points within each cluster. This algorithm is still commonly used today.

In subsequent years, other algorithms for hierarchical clustering were developed, such as the single linkage method, the complete linkage method, and the average linkage method. These algorithms differ in how they determine the distance between clusters and how they combine adjacent clusters to form a hierarchy.

Hierarchical clustering can be represented by a dendrogram, a tree-like diagram that shows the hierarchical structure of clusters. This allows you to intuitively see how the data has been grouped into larger and smaller clusters.

This approach is used in many areas, such as market segmentation, computational biology, social media analytics, and more. It is a popular option when you don’t know the exact number of clusters you want and want to explore the data structure in more detail.

To better understand hierarchical clustering, let’s start by defining some fundamental notions:

Distance between clusters: You need to define a distance metric to measure the similarity between clusters. One of the most common metrics is Euclidean distance. However, other metrics such as Manhattan distance, Minkowski distance, Pearson correlation, can be used depending on the context. The Euclidean distance between two points (x) and (y) in (n)-dimensional space is given by:

Linking method: This method determines how to calculate the distance between two clusters based on the distance between their member points. Some of the common methods include:

Single Linkage (minimum): The distance between two clusters is the minimum distance between a point in the first cluster and a point in the second cluster.

Complete Linkage (maximum): The distance between two clusters is the maximum distance between a point in the first cluster and a point in the second cluster.

Average Linkage (average): The distance between two clusters is the average of the distances between the points of two clusters.

Dendrogram: It is a graphical representation of the cluster hierarchy. Shows how clusters are merged or split at each step and the distance between them.

With these notions in mind, the hierarchical clustering process can be described in mathematical terms:

1. Initialization: Each data point is considered a separate cluster.
2. Calculation of distance matrix: The distance matrix between all data points is calculated.
3. Cluster Merging: The two most similar clusters based on the chosen linking method are iteratively merged.
4. Updating the distance matrix: The distance matrix is updated taking into account the new cluster configuration.
5. Repetition: Steps 3 and 4 are repeated until all points have been grouped into a single cluster or until the desired number of clusters has been reached.
6. Dendrogram: The dendrogram is constructed to visualize the cluster hierarchy and help determine the optimal number of clusters.

Hierarchical clustering can be computationally expensive, especially for large datasets, but it provides an intuitive view of the structure of the data and can be useful when you don’t know the exact number of clusters you want.

Let’s implement an example of Hierachic Clustering with Scikit-learn

First, let’s think about the starting data. In this regard, we can use make_blobs. This is a feature provided by the scikit-learn library that allows you to generate synthetic datasets often used for testing and demonstration purposes in machine learning algorithms. This function generates clusters of random points in two-dimensional or three-dimensional space, with the specified cluster centroids and standard deviation of the clusters.

In practice, make_blobs generates a specified number of random samples distributed around specified centroids, with a specified standard deviation that controls the dispersion of points within each cluster.

This feature is useful for creating test datasets with different levels of complexity and cluster structures, allowing developers and researchers to test and compare clustering, classification, and other machine learning algorithms on synthetic data before applying them to real data.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

# Generate random data using make_blobs function
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Visualize the generated data
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.title('Generated Data')
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

By running the code we will obtain the graphical representation of the points generated by make_blobs.

We now apply the hierarchical clustering algorithm to these data, reporting the number of clusters to be identified as 4.

model = AgglomerativeClustering(n_clusters=4)
model.fit(X)

plt.scatter(X[:, 0], X[:, 1], c=model.labels_, s=50, cmap='viridis')
plt.title("Hierarchical Clustering Results")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

Executing you get the following result:

As we can see from the image, the clustering went perfectly. The points were divided into 4 clusters and identified with different colors.

Let’s also visualize the underlying dendrogram:

# Build the dendrogram
linked = linkage(X, 'ward')  # 'ward' is a common linkage method
plt.figure(figsize=(12, 8))
dendrogram(linked, orientation='top', distance_sort='descending', show_leaf_counts=True, color_threshold=7)  # Modify this value according to your requirements
plt.title('Dendrogram with branch coloring')
plt.xlabel('Sample indices')
plt.ylabel('Distance')
plt.show()

cluster_colors = {i: plt.cm.nipy_spectral(float(idx) / model.n_clusters) for i, idx in enumerate(np.unique(cluster_labels))}

By executing this, the dendrogram is obtained with the colors corresponding to the 4 clusters identified during the hierarchical subdivision of the points.

Problem of the number of clusters to identify

In the previous example we had established the number of clusters to identify in the dataset as 4. We were already aware of this fact, but in reality this is not always the case. But then how to evaluate the number of clusters into which to divide the data. Well, there are special methods, such as the elbow method, which we will apply to the previous case, pretending not to know the number of clusters to use.

To this end, we implement the following code:

# Calculate the variation of the sum of squared intra-cluster distances
sum_of_squared_distances = []
for k in range(1, 11):
model = AgglomerativeClustering(n_clusters=k)
model.fit(X)
labels = model.labels_
centroids = np.array([X[labels == i].mean(axis=0) for i in range(k)])
distances = np.array([np.sum((X[labels == i] - centroids[i])**2) for i in range(k)])
sum_of_squared_distances.append(np.sum(distances))

# Plot the elbow method
plt.plot(range(1, 11), sum_of_squared_distances, marker='o')
plt.title('Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('Sum of Squared Intra-Cluster Distances')
plt.xticks(range(1, 11))
plt.grid(True)
plt.show()

Running the code you get the following result:

As we can see from the curve, the “elbow” corresponds to the value of 4 clusters, which we already knew was correct.

The sum of intra-cluster squared distances is a measure of cluster cohesion. As more clusters are added, this sum tends to decrease as the clusters become smaller and more homogeneous. However, at some point, adding additional clusters may not lead to a significant reduction in the sum of intra-cluster squared distances. This point is often called the “elbow” in the curve of the sum of squared intra-cluster distances.

The idea of the elbow method is to locate the point at which the decrease in the sum of intra-cluster squared distances begins to decrease dramatically, indicating that adding additional clusters does not lead to a significant reduction in cluster cohesion. This point can be considered as the optimal number of clusters to select.

So, when we look at the curve of the sum of squared intra-cluster distances and locate the point where an “elbow” forms, we can consider that number of clusters as a good candidate for the optimal number of clusters to use in our clustering algorithm .

Clustering Gerarchico vs K-Means

Hierarchical Clustering has a hierarchical structure. It creates a tree structure of clusters that can be explored to get a more detailed view of the relationship between clusters. Therefore it is not necessary to specify the number of clusters desired a priori, since hierarchical clustering builds a hierarchy of clusters. It is robust to noise, and can handle datasets containing outliers or noise thanks to its hierarchical structure. Unfortunately, it is computationally expensive and requires more time and computational resources than K-means, especially on large datasets.

K-means is faster and scalable than hierarchical clustering, making it more suitable for large datasets. It requires specifying the number of clusters a priori, which could be problematic if the optimal number of clusters is not known. The outcome of the clustering may depend on the random choice of initial centroids, so it may be necessary to run the algorithm several times with different initializations. It is effective when the clusters are clearly distinct and separable. It is best suited for real-time applications where speed of execution is prioritized over understanding the structure of the data.

So Hierarchical Clustering is preferable when you want a hierarchical view of the data structure, you don’t know the exact number of clusters you want, and you have enough computational resources to handle the algorithm.

While K-means is preferable when you know the number of clusters you want, you have large datasets and require greater computational efficiency.

Additionally, it is important to consider the specific context of the problem and the characteristics of the data when choosing between the two clustering algorithms. In some cases, it may be useful to run both algorithms and compare the results to gain a more complete understanding of the structure of the data.