Spectral Clustering in Machine Learning with Scikit-learn

Spectral Clustering header

Spectral clustering is a clustering technique used in machine learning to group together similar data sets. It is based on the analysis of the spectra of the similarity or dissimilarity matrices between the data. This technique is particularly effective when the data has a nonlinear structure or when the separation between clusters is not clearly defined in Euclidean space. The spectral clustering process usually involves three steps: the construction of a similarity or dissimilarity matrix, dimensionality reduction, and the application of a clustering algorithm on the transformed data. This technique is useful in several areas, including pattern recognition, image analysis, and document classification.

Spectral Clustering

Spectral clustering in machine learning has a fascinating history rooted in graph theory and linear algebra. Initially, fundamental concepts such as the adjacency matrix and the Laplacian of a graph provided the mathematical foundation for the spectral approach to clustering. During the 1980s and 1990s, as interest in the application of graph theory and linear algebra in machine learning increased, scholars began to explore how to use the eigenvalues and eigenvectors of matrices associated with data for clustering and dimensionality reduction purposes.

Spectral clustering has undergone significant development over time, with major contributions from researchers such as Andrew Ng and Michael Jordan. This has led to a better understanding of the principles underlying spectral clustering and its practical application in a variety of domains. Its diffusion in both the academic and industrial fields has been notable, thanks to its ability to manage complex and high-dimensional data.

The use of spectral clustering has expanded to a wide range of fields, including pattern recognition, image analysis, community detection in social networks, and much more. Continuous research and development has led to new techniques, algorithms and applications, with the aim of creating models that are more efficient, scalable and adaptable to a wide range of clustering problems.

Spectral clustering is a technique that leverages graph theory and eigenvalue analysis to group similar data together.

  1. Construction of the similarity or dissimilarity matrix (W): In this phase, a matrix is calculated that represents the similarity or dissimilarity between the points in the dataset. This matrix can be represented as an adjacency matrix of a weighted graph. For example, if we have a set of points x_1, x_2, …, x_n, we can calculate a similarity matrix (W) in which w_{ij} represents the similarity between the points x_i and x_j. The formulas for calculating w_{ij} can vary depending on the context and nature of the data.
  2. Dimensional reduction (D): Once the similarity matrix (W) is obtained, a diagonal matrix (D) is constructed containing the sums of the rows of (W). This step is used to calculate the so-called normalized Laplacian, which will be used later. The matrix (D) can be defined as:
     D_{ii} = \sum_{j=1}^{n} w_{ij}
  3. Calculation of the normalized Laplacian (L): The normalized Laplacian (L) is calculated as the difference between the degree matrix (D) and the similarity matrix (W). The normalized Laplacian is important because it contains information about the structure of the data that is useful for clustering. It is defined as:
     L = D - W
  4. Calculation of eigenvalues and eigenvectors (λ, v): Next, the eigenvalues and eigenvectors of the normalized Laplacian (L) are calculated. The eigenvalues (\lambda) and eigenvectors (v) satisfy the equation:
     Lv = \lambda v
    The eigenvectors corresponding to the smallest eigenvalues represent the main dimensions of the subspace into which the data can be projected.
  5. Clustering using eigenvectors: Finally, the data is clustered using the eigenvectors corresponding to the smallest eigenvalues as features for clustering algorithms such as K-Means.

This is a general approach to spectral clustering. However, there may be variations and extensions depending on the specific context and objectives of the clustering problem.

Spectral Clustering with Scikit-learn

Spectral clustering can be implemented with scikit-learn, a popular machine learning library in Python. Scikit-learn provides a class called SpectralClustering that allows you to perform spectral clustering on a dataset. This class offers several options to customize the algorithm, including several methods for calculating the affinity between points and the desired number of clusters.

from sklearn.datasets import make_moons
from sklearn.cluster import SpectralClustering
import matplotlib.pyplot as plt

X, _ = make_moons(n_samples=1000, noise=0.1, random_state=42)

# Visualize the generated data
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], marker='o', edgecolor='k')
plt.title('Generated Moon-Shaped Data')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

By executing this you obtain the scatterplot of the points of the newly generated dataset:

Spectral Clustering - dataset mezzaluna

Now that we have the dataset, let’s apply a spectral clustering algorithm:

spectral_clustering = SpectralClustering(n_clusters=2, affinity='rbf', gamma=20.0, random_state=42)
clusters = spectral_clustering.fit_predict(X)

plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis')
plt.title('Spectral Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

Running gives you a graphical representation of the clustering results, coloring the points based on the cluster label assigned to them.

Spectral Clustering - results 01

However, with this clustering method, things are not so simple. If we check

spectral_clustering = SpectralClustering(n_clusters=2, affinity='rbf', gamma=20.0, random_state=42)

We used a series of very particular parameters.

In the context of spectral clustering, the affinity parameter refers to the method used to calculate the similarity or dissimilarity between points in the dataset. This parameter affects the construction of the similarity or dissimilarity matrix, which is fundamental for spectral clustering.

When we talk about “affinity”, we are essentially referring to how similar two points are to each other. In practical terms, affinity determines how close or similar points are in a multidimensional space. The more similar two points are, the greater the value of the affinity between them, and vice versa.

In the context of spectral clustering, there are several methods to calculate affinity. Some of the main methods are:

  • Nearest Neighbors: The affinity between points is calculated based on the Euclidean distance or a distance metric defined between the closest points.
  • RBF (Radial Basis Function): Affinity is calculated using a Gaussian radial function, which measures the distance between points in feature space.
  • Precomputed: Allows you to provide a precomputed similarity or dissimilarity matrix between points.
  • KNN (K-Nearest Neighbors): Uses the similarity between the k-nearest points as affinity.

Each of these affinity methods has its own advantages and may be better suited to certain data types or clustering structures. Choosing the appropriate affinity method is important to obtain accurate and meaningful results in spectral clustering.

In the case of moon-shaped clusters we used RBF as affinity.

The gamma parameter is specific to radial basis kernels (RBFs) and other similar kernels used in machine learning techniques such as Support Vector Machine (SVM) and spectral clustering. In this context, gamma controls the width of the kernel function.

Regarding spectral clustering, when using RBF-based affinity, as in the case of affinity=’rbf’, the gamma parameter influences the shape of the Gaussian radial function used to calculate the affinity between points in the dataset.

Simply put, a higher value of gamma means that the Gaussian radial function becomes narrower and has a smaller radius, thus making points closer together more influential in the affinity. Conversely, a lower value of gamma means that the Gaussian radial function extends and has a larger radius, thus considering a larger number of points in the affinity calculation.

Thus, adjusting the gamma parameter is important to achieve optimal results in spectral clustering and other algorithms that use RBF kernels. By experimenting with different gamma values, you can find the right compromise to adapt the algorithm to your data and obtain better clustering results.

# Define the gamma values to test
gamma_values = [0.1, 1.0, 10.0, 100.0]

# Create subplots for each gamma value
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

for i, gamma in enumerate(gamma_values):
    row = i // 2
    col = i % 2
    
    # Define the Spectral Clustering model with the current gamma
    spectral_clustering = SpectralClustering(n_clusters=2, affinity='rbf', gamma=gamma, random_state=42)
    
    # Perform clustering
    clusters = spectral_clustering.fit_predict(X)
    
    # Visualize clustering results
    axes[row, col].scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis')
    axes[row, col].set_title(f'Spectral Clustering with gamma={gamma}')
    axes[row, col].set_xlabel('Feature 1')
    axes[row, col].set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

By changing the gamma values for our example we obtain the following results:

Spectral Clustering - results with gamma

As we can see, we start to have acceptable results for gamma values > 10.

A further example

Let’s now look at another example, using a completely different dataset.

from sklearn.datasets import make_circles
from sklearn.cluster import SpectralClustering
import matplotlib.pyplot as plt

# Generiamo dei dati di esempio a forma di cerchi concentrici
X, _ = make_circles(n_samples=400, factor=0.5, noise=0.05)

# Visualize the generated data
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], marker='o', edgecolor='k')
plt.title('Generated Moon-Shaped Data')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

The points this time will be arranged circularly, and the clusters will be arranged on different diameters

Spectral Clustering - results circles

If we apply the same previous method to this dataset with RBF affinity over a range of different gammas:

Spectral Clustering - results circles gamma

We see that optimal results are obtained with gamma equal to 100.

But a similar result can also be achieved by changing the affinity, with “Nearest Neighbors“.

spectral_clustering = SpectralClustering(n_clusters=2, affinity='nearest_neighbors', random_state=42)
Spectral Clustering - Nearest Neighbors

In this case we obtained satisfactory results in both cases.

Is it better to use RBF with very high gamma or Nearest Neighbors?

In the previous example we saw that the best results are obtained either with RBF affinities with very high gamma values or with Nearest Neighbors affinities. There is no better or worse in all circumstances. The choice may depend on the type of data you are analyzing and the goals of the clustering.

RBF affinity with very high gamma:

This approach tends to produce more compact and dense clusters. With a large gamma value, the RBF function will have a very small radius, which means that the influential points in the affinity will only be those that are very close. It is suitable for datasets where the clusters are compact and well separated.

Nearest Neighbors Affinity:

This approach uses the similarity between the k-nearest neighbors as affinity, making points close to each other influential in the affinity. It is suitable for datasets with more complex or non-linear structures, where clusters may have irregular shapes or intersections. It may be more robust than RBF affinity with very high gamma in cases where clusters have varied sizes or densities.

In general, if your data has well-defined and separated clusters, using RBF affinity with high gamma might be a reasonable choice. However, if your data is more complex and you need more flexibility in defining clusters, you may prefer to use Nearest Neighbors affinity.

I recommend you try both approaches and evaluate the results based on your specific needs and the structure of the data you are dealing with.

In general, when to use Spectral Clustering?

to other clustering methods. Here are some cases where spectral clustering might be preferable:

  • Nonlinear or complex datasets: Spectral clustering is especially useful when the data has a nonlinear or complex structure, where other clustering algorithms may have difficulty separating clusters effectively. Due to its ability to capture nonlinear relationships between points, spectral clustering may be more suitable in these cases.
  • Large datasets: Spectral clustering can handle large datasets efficiently, as the distance calculation is performed on a reduced data representation obtained through spectral decomposition. This makes it an advantageous choice for large datasets where other clustering algorithms may be computationally expensive.
  • Datasets with variations in cluster density or size: Spectral clustering makes no assumptions about the shape or density of clusters, making it suitable for datasets where clusters have variations in density or size. It can effectively handle clusters of different shapes and varied sizes.
  • Semi-supervised clustering: Spectral clustering can be extended to support semi-supervised clustering by incorporating partial or supervised label information into the clustering process.
  • Datasets with graphs or structured data: Spectral clustering is based on the analysis of the spectra of similarity or dissimilarity matrices between data, making it suitable for structured data, such as graphs, networks, or data with defined intrinsic relationships.

However, it is important to note that spectral clustering can be more computationally expensive than other clustering algorithms, especially for large datasets. Furthermore, the choice of affinity and parameters can significantly influence the spectral clustering results. Therefore, it is advisable to carefully evaluate the characteristics of your dataset and experiment with different clustering approaches to find the best solution for your problem.

.

Leave a Reply