Clustering with DBSCAN in Machine Learning with Python

Post Views: 262

DBSCAN, an acronym for Density-Based Spatial Clustering of Applications with Noise, is a clustering algorithm used in machine learning to group data sets based on their density in space. The main goal of DBSCAN is to identify regions of high density separated by regions of low density.

The DBSCAN algorithm

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm invented by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It was presented in their paper titled “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”.

The main idea of DBSCAN is to identify clusters of points in a given space, based on the density of the points. Unlike other clustering algorithms such as K-Means, DBSCAN does not require you to specify the number of clusters a priori. Instead, it automatically determines clusters based on the density of points in the data

There are some fundamental concepts underlying this algorithm:

density point
central point
edge point
noise point

DBSCAN defines a density point as a point with a minimum number of points within a certain distance. A point ( p ) is defined as having density if the number of points within a certain distance $\epsilon$ from ( p ) exceeds a minimum value ( MinPts ). This can be expressed as:

$N_{\epsilon}(p) = {q \in D \mid dist(p, q) \leq \epsilon}$

Where $N_{\epsilon}(p)$ is the number of points within the distance $\epsilon$ from ( p ) and ( D ) is the set of data.

A point p is defined as a central point if the number of points within a certain distance (eps) is greater than or equal to a minimum value (MinPts). Mathematically:

$|N_{\epsilon}(p)| \geq MinPts$

An edge point is a point that is not central but is reachable from a central point. In other words, it is a point that is within the vicinity of a central point but does not satisfy the density requirement.

A Noise Point is neither central nor edge.

The DBSCAN algorithm uses these definitions to identify clusters in the data. It starts with a random unvisited point, and if this point is central, all points in its neighborhood are assigned to the same cluster. The algorithm proceeds iteratively until all points have been visited.

A key formula in the algorithm is the definition of “reachability” between two points. Two points (p) and (q) are considered reachable if there exists a sequence of points connecting (p) to (q) where each subsequent point is a central point. This is expressed as:

$\text{Reachability}(p, q) = \begin{cases} \text{True}, & \text{se } q \text{ can be reached from } p \ \text{False}, & \text{else} \end{cases}$

Furthermore, the density of a cluster can be expressed as the sum of the density points at the center points of the cluster:

$\text{Cluster density}(C) = \sum_{p \in C} |N_{\epsilon}(p)|$

These formulas and definitions provide a solid foundation for understanding and implementing the DBSCAN algorithm in data clustering.

The algorithm proceeds like this:

Start with a random unvisited point.
If this point is a central point, all points reachable from it are part of its cluster.
If the point is not central, it is marked as noise.
Repeats the process until all points have been visited.

The main features of DBSCAN are its ability to handle arbitrarily shaped clusters and its resistance to noise. However, it requires the specification of two parameters: eps and MinPts, which can make it sensitive to choosing optimal parameters for certain data sets.

Overall, DBSCAN is a great choice when you have data with varied densities and want to identify clusters of different shapes and sizes.

Use DBSCAN for clustering with Scikit-learn in Python

Like many other clustering algorithms, DBSCAN is also present in Python’s scikit-learn library. You can use the DBSCAN class to perform density-based clustering using this algorithm.

In this example we will use make_moons from scikit-learn to generate data automatically. This function is used to generate a synthetic data set that has the shape of two semicircles, similar to two crescent moons, within a two-dimensional space. This type of dataset is often used for demonstration purposes or to test clustering and classification algorithms.

Let’s see an example code to generate the data to be clustered later:

from sklearn.datasets import make_moons
import matplotlib.pyplot as plt

# Generate moon-shaped data
X, _ = make_moons(n_samples=1000, noise=0.1)

# Visualize the generated data
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], marker='o', edgecolor='k')
plt.title('Generated Moon-Shaped Data')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

Executing we get the following result:

Here is an explanation of the main parameters used with make_moons:

n_samples: The number of data points to generate. In this example, we specified 1000 points.
noise: The standard deviation of the Gaussian noise added to the generated data. The higher the noise value, the greater the dispersion of the data points. In this example, we have set noise to 0.1, which means that a small amount of noise is added to the generated data to make it more realistic.

Then, the make_moons function generates a set of random data by placing two semicircles in a two-dimensional space, and adds some Gaussian noise to make the data a little more realistic. This is especially useful for testing clustering algorithms like DBSCAN in scenarios where the data has nonlinear or complex shapes.

Now let’s apply DBSCAN to the newly generated dataset:

dbscan = DBSCAN(eps=0.10, min_samples=5)

labels = dbscan.fit_predict(X)

plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', marker='o', edgecolor='k')
plt.title('DBSCAN Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar(label='Cluster')
plt.show()

Executing we get the following result:

As we can see from the graph, we have the clear definition of the two clusters, with two different colors. Yellow (cluster 1) and Green (cluster 0) which are very clear and there are no points of intermediate color which represent ambiguities between the assignment of the point in question to the two clusters. Furthermore, the points with the blue color (cluster -1) are those colors that are considered external and not belonging to any cluster.

Evaluation of the parameters to pass to DBSCAN

We saw an optimal result with the perfect separation of the points into two very distinct clusters. To obtain these results we used two specific parameter values passed to DBSCAN:

dbscan = DBSCAN(eps=0.10, min_samples=5)

But a priori, when we start from scratch, how can we know the optimal value of these values?

There are methods that allow us to obtain the optimal values of these parameters:

the elbow method
the silhouette method

Let’s start, using the elbow method to find an appropriate value for eps. This method involves plotting the curve of the average distance between points and their nearest neighbor (k-dist graph), and finding the point where the curve starts to bend. This point can be considered as a good value for eps.

from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt

# Compute the k-nearest neighbors for each point
k = 10  # Number of neighbors to consider
nbrs = NearestNeighbors(n_neighbors=k).fit(X)
distances, indices = nbrs.kneighbors(X)

# Compute the average distance of the k-th nearest neighbor for each point
avg_distances = distances.mean(axis=1)

# Visualize the plot of k-nearest neighbors distance
plt.plot(sorted(avg_distances))
plt.xlabel('Point')
plt.ylabel(f'Average {k}-Nearest Neighbor Distance')
plt.title(f'{k}-Nearest Neighbor Distance Plot')
plt.show()

Running you get:

As the optimal value, I evaluate the value corresponding to the “elbow of the curve” which, when rounded, I consider equal to 0.10. Now that I know the best value of eps (0.10) I can apply the silhouette method to obtain the optimal value of min_sample,

from sklearn.metrics import silhouette_score
import numpy as np

# Optimal value for eps
eps = 0.10

# Range of values for min_samples to explore
min_samples_range = range(2, 10)

# Initialize a list to save silhouette scores
silhouette_scores = []

# Compute silhouette score for each min_samples value
for min_samples in min_samples_range:
    dbscan = DBSCAN(eps=eps, min_samples=min_samples)
    labels = dbscan.fit_predict(X)
    if len(np.unique(labels)) > 1:  # Ensure there are at least 2 clusters
        silhouette_avg = silhouette_score(X, labels)
        silhouette_scores.append(silhouette_avg)
    else:
        silhouette_scores.append(-1)  # Set a negative score if there's only one cluster

# Find the min_samples value with the maximum silhouette score
best_min_samples = min_samples_range[np.argmax(silhouette_scores)]
best_silhouette_score = max(silhouette_scores)

print(f"The best value for min_samples is {best_min_samples} with a silhouette score of {best_silhouette_score}")

Running you get:

The best value for min_samples is 5 with a silhouette score of 0.26399382975954

These are the two parameters I entered at the beginning. The parameters may vary from dataset to dataset.

DBSCAN vs K-Means

Regarding clustering, the most commonly known and used algorithm is K-Means. It is therefore important to know what the differences are between DBSCAN and K-Means and when it is more appropriate to use one or the other.

DBSCAN is a density-based clustering algorithm that assigns points to clusters based on data density. It requires the specification of two parameters: ( \epsilon ), which represents the maximum distance between points to be considered neighbors, and ( MinPts ), the minimum number of points within ( \epsilon ) to form a cluster. Identifies regions of high density by separating them from regions of low density. It is not necessary to specify the number of clusters in advance. It can identify clusters of arbitrary shape and is not constrained to clusters of spherical shape. It is noise resistant and can identify noise points as separate clusters. It is more efficient than K-means on datasets with varied densities, but may be less effective on large datasets due to computational complexity.

K-means is a clustering algorithm that assigns points to one of ( k ) clusters, where ( k ) is a predefined number of clusters specified by the user. It requires specifying the number of clusters ( k ) a priori. Try to minimize the sum of the squared distances between the points and the centroid of each cluster. It produces spherical clusters, as it minimizes the distance between the points and the centroid. K-means is sensitive to outliers and can be affected by noisy data. It is effective on large datasets, but can be computationally expensive.

When to use K-means:

When the approximate number of clusters is known a priori and you want to produce spherical clusters.
When the data is relatively clean and does not contain many outliers.

When to use DBSCAN:

When the shape and size of the clusters are not known a priori and you want to identify clusters of arbitrary shape.
When your data contains noise or outliers and you want those points to be treated as separate clusters or noise.
When you want to avoid specifying the number of clusters a priori.

In summary, the choice between K-means and DBSCAN depends on the specific characteristics of the data and the objectives of the clustering. It is important to carefully examine the properties of the data, the expected shape of the clusters, the presence of noise and the need to easily interpret the results before selecting the most suitable algorithm. In some cases, it may be useful to run both algorithms and compare the results for example

The DBSCAN algorithm

Use DBSCAN for clustering with Scikit-learn in Python

Evaluation of the parameters to pass to DBSCAN

DBSCAN vs K-Means

Leave a Reply Cancel reply