Affinity Propagation Clustering in Machine Learning with scikit-learn

Affinity Propagation header

Affinity Propagation is a clustering algorithm in machine learning used to identify clusters within a data set. It is based on the concept of “similarity” between data instances rather than Euclidean distance. The algorithm tries to find a set of exemplars that best represent the data set, using a similarity matrix to calculate the “liabilities” and “availabilities” between instances. This method is useful in situations where clusters have a graph structure and can be effective even with large amounts of data.

Affinity Propagation (AP)

Affinity Propagation (AP) is a clustering algorithm that uses the concept of similarity between data instances to identify cluster “exemplars” or “centroids”. It was first proposed by Brendan J. Frey and Delbert Dueck in 2007, in their paper titled “Clustering by Passing Messages Between Data Points”. This article was published in the journal Science, one of the most prestigious scientific journals in the world.

Frey and Dueck’s main motivation was to develop a clustering algorithm that could efficiently handle large datasets without the need to specify the number of clusters a priori. Affinity Propagation relies on the concept of “message passing” between data instances to automatically identify clusters and their centroids.

One of the key features of Affinity Propagation is its ability to identify both clusters and centroids simultaneously, without requiring separate iterative algorithms to search for centroids as in some other clustering algorithms. Affinity Propagation has gained popularity in the fields of machine learning and data science due to its flexibility and performance on a wide range of clustering problems. However, due to its computational complexity, it may be less suitable for very large or high-dimensional datasets.

Let’s look at some of its features in more detail:

Similarity: The algorithm is based on a similarity matrix ( S ), where ( S_{ij} ) represents the similarity between instances ( i ) and ( j ) of the data. This similarity can be calculated using various metrics, such as cosine similarity, Jaccard similarity, or other problem-specific similarity metrics.

Responsibility ( r ): Responsibility ( r(i, k) ) represents how much the instance ( i ) should choose the instance ( k ) as its exemplar. It is calculated using the equation:

 r(i, k) = S(i, k) - \max_{k' \neq k} { a(i, k') + S(i, k') }

where ( a(i, k) ) represents the accumulation of responsibilities that instance ( i ) assigns to instance ( k ).

Availability ( a ): Availability ( a(i, k) ) represents how willing the instance ( k ) is to accept the instance ( i ) as its exemplar. It is calculated using the equation:

 a(i, k) = \min { 0, r(k, k) + \sum_{i' \neq i, i' \neq k} \max {0, r(i', k)} }

where ( r(k, k) ) represents the accumulation of responsibilities that the instance ( k ) assigns to itself.

Iterations: The algorithm alternately iterates between the calculation of responsibilities and availability until convergence.

Cluster assignment: After convergence, cluster exemplars are identified by the indices of the rows with non-zero values on the diagonal of the matrix ( r + a ).

This is just an introduction to Affinity Propagation with the basic mathematical formulas. Deep understanding requires a more detailed exploration of the equations and their implications in the context of clustering.

Affinity Propagation with Scikit-Learn

The Affinity Propagation implementation is provided in Scikit-learn as part of the sklearn.cluster module. You can use the AffinityPropagation class to cluster data using this algorithm.

So first let’s prepare a representative dataset on which to apply this method. We can use make_blobs, a scikit-learn function that allows us to generate random points distributed around centroids, that is, it allows us to generate a dataset with clusters already inside.

from sklearn.cluster import AffinityPropagation
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# Generate a sample dataset
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Visualize the generated data
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.title('Generated Data')
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")

This will generate an example dataset with 4 clusters, each with a normal distribution and a standard deviation of 0.6, and visualize it using a scatterplot. The coloring of the points will correspond to the generated clusters.

Executing we obtain the following dataset:

Affinity Propagation - dataset

Now that we have a suitable dataset let’s see how to apply Affinity Propagation on it.

af = AffinityPropagation(preference=-50, random_state=0)

# Label the data with assigned clusters
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_

# Number of clusters found
n_clusters_ = len(cluster_centers_indices)

print(f"Number of clusters found: {n_clusters_}")

# Plot the results
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.title('Affinity Propagation Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

In the code, an AffinityPropagation object is created with the preference parameter set to -50 and the random seed random_state set to 0. The Affinity Propagation algorithm is then trained on data X using the fit(X) method.

After training the model, the data instances are labeled with the assigned clusters. The centroid indices of the identified clusters are extracted via af.cluster_centers_indexes_. The cluster labels assigned to each data instance are obtained via af.labels_.

The preference value set to -50 influences the selection of the initial centroids, indirectly influencing the number of identified clusters. Since the preference parameter is negative, a more restrictive selection of initial centroids is favored, which could lead to a lower number of clusters compared to the default configuration.

The number of clusters found is printed on the screen to provide an indication of the clustering result. Viewing the scatter plot provides further understanding of the data distribution and identified clusters.

Executing this results:

Affinity Propagation - result

Evaluation of the results obtained from Affinity Propagation

To evaluate the validity of the clustering obtained with the Affinity Propagation algorithm, we can use the silhouette coefficient. The silhouette coefficient measures how similar each data point is to its assigned cluster compared to other clusters. A high value of the silhouette coefficient indicates that the points are well clustered, while a low value indicates that the points may have been incorrectly assigned to a cluster.

Here’s how to calculate the silhouette coefficient in your example:

from sklearn.metrics import silhouette_score

# Calculate the silhouette score
silhouette_avg = silhouette_score(X, labels)
print(f"Average silhouette score: {silhouette_avg}")

The average silhouette coefficient is a value between -1 and 1, where:

  • A value close to 1 indicates that the samples have been correctly assigned to their clusters and that the clusters are well separated.
  • A value close to 0 indicates that clusters may overlap.
  • A negative value indicates that samples may have been assigned to the wrong clusters.

Executing we get the following result

Average silhouette score: 0.6819938690643478

In our case, the average silhouette coefficient is 0.6819938690643478, which is a pretty good value. It indicates that the samples were grouped quite appropriately into their clusters, with some separation between the clusters. This suggests that the clustering achieved with Affinity Propagation in your example could be considered quite good. However, it is always a good idea to also look at other metrics and visualizations for a more complete assessment of clustering quality.

When to use Affinity Propagation for clustering?

Affinity Propagation is a clustering algorithm with unique characteristics that make it suitable for certain data types and contexts. Here are some situations where you might consider using Affinity Propagation over other clustering methods in machine learning:

  • Unknown number of clusters: Affinity Propagation is useful when the number of clusters is not known a priori and you want the model to automatically determine the optimal number of clusters. Unlike algorithms such as K-Means, it is not necessary to specify the number of clusters a priori.
  • Groups of different sizes: Affinity Propagation can handle groups of different sizes and densities. Since it does not require the assumption of clusters with spherical shape or equal size, it can also be effective on datasets where the clusters have various shapes and sizes.
  • Data with non-linear or non-metric similarity: Affinity Propagation does not require data to be distributed in a Euclidean space and can be applied to data with non-linear or non-metric similarity. This makes it suitable for a wide range of problems and data types.
  • Interpreting Centroids: Affinity Propagation identifies not only clusters, but also the “specimens” or “centroids” within each cluster. This can be useful if you want to interpret landmarks within clusters.
  • Large data: Affinity Propagation can effectively handle large amounts of data, thanks to its reasonable computational complexity.

However, it is important to note that Affinity Propagation may not be the best choice for all data types or clustering problems. It can be sensitive to its parameters and may require tuning to achieve optimal results. Furthermore, it can be computationally expensive on very large datasets. Therefore, it is always advisable to explore and compare different clustering algorithms to find the one that best suits your specific needs.

Leave a Reply