How to generate specific datasets for clustering with Scikit-learn

Post Views: 455

Scikit-learn, one of the most popular libraries for machine learning in Python, offers several functions for generating datasets suitable for many clustering purposes. These functions allow you to create summary datasets, which are artificially created with the specific goal of being used to perform clustering operations and to evaluate the performance of clustering algorithms.

The datasets that can be generated by Scikit-learn

The Scikit-learn library provides a series of functions that allows you to simply and automatically generate a series of datasets suitable for clustering studies. Each function allows you to generate distributions of points in a dataset with particular characteristics which generates natural clusters with peculiar shapes, specific for each case. Let’s see a list of these functions:

make_blobs: Generate a set of isotropic blobs for clustering.
make_moons: Generate two semicircles arranged inside each other for clustering.
make_circles: Generate a circle of points arranged around another circle for clustering.
make_gaussian_quantiles: Generate clusters of points distributed according to a multivariate Gaussian.
make_s_curve: Generates an S-shaped 3D wave that can be used for clustering.
make_swiss_roll: Generates a Swiss roll-shaped dataset that can be used for clustering.

These datasets can be used to test clustering algorithms or to perform clustering experiments in controlled environments. An important notification is that some of these functions generate labeled datasets, i.e. with in addition (returned y value) the cluster membership labels. Let’s see how the function group is divided

Labeled datasets:

make_blobs
make_moons
make_circles
make_gaussian_quantiles

Unlabeled datasets:

make_s_curve
make_swiss_roll

This allows you to choose the right function based on your clustering or classification needs.

The generation of summary datasets for clustering with scikit-learn

Now let’s see how we can generate these summary datasets in code. Their implementation is really very simple and consists of just one function call. Let’s write the code

from sklearn.datasets import make_blobs

# Generate a dataset of 1000 points distributed in 5 clusters
# with standard deviation (cluster_std) set to 1.0
X, y = make_blobs(n_samples=1000, centers=5, cluster_std=1.0, random_state=42)

cluster has a standard deviation (variability) of the points equal to 1.0. The points generated are represented in a scatter plot where each color represents a class to which it belongs.

Here is an explanation of the parameters used in the make_blobs function:

n_samples: This parameter specifies the total number of points to generate in the dataset. In our example, n_samples=1000 indicates that we want to generate a dataset with 1000 points.
centers: This parameter indicates the number of clusters to generate in the dataset. In our example, centers=5 specifies that we want to generate 5 distinct clusters.
cluster_std: This parameter controls the standard deviation (variability) of the generated clusters. A higher value of cluster_std means that the points within each cluster will be more dispersed, while a lower value means that the points will be more compact around the center of the cluster. In our example, cluster_std=1.0 indicates that we want clusters with a standard deviation of 1.0.
random_state: This parameter is used to initialize the pseudo-random number generator. Providing a value to random_state ensures that the results are reproducible. If two calls to make_blobs use the same value for random_state, they will generate the same dataset. In our example, random_state=42 is an arbitrary value used for reproducibility of results. You can choose any integer for this parameter.

To be able to visualize the distribution of the dataset and the clusters built in it, we can use the matplotlib library. So let’s write the following code

import matplotlib.pyplot as plt

# Plot the generated points
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', s=50, alpha=0.7)
plt.title('Dataset of Isotropic Blobs')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar(label='Class')
plt.grid(True)
plt.show()

Running the code we get the following representation:

Dataset for clustering - isotropic blobs

The datasets generated by Scikit-learn for clustering were created with the goal of providing users with versatile tools to explore and understand how clustering algorithms work. These datasets are useful in different contexts.

Isotropic Blob Dataset

We have just seen how Scikit-learn’s make_blobs function generates a synthetic dataset composed of clusters of points, commonly called “blobs”. These blobs are randomly distributed in the feature space according to an isotropic Gaussian distribution, meaning that the variance is the same in all directions.

This type of dataset is useful for testing and evaluating clustering algorithms, as it offers precise control over cluster parameters. It is particularly useful for evaluating clustering algorithms that require spherical or isotropic clusters.

Moon-Shaped Dataset

Scikit-learn’s make_moons function generates a synthetic dataset composed of two overlapping semicircles. Here are some main characteristics of a dataset generated with make_moons:

Shape of the clusters: The generated clusters are made up of two overlapping semicircles. This makes the dataset particularly useful for testing clustering algorithms that need to handle nonlinear or complex clusters.
Noise: You can specify the level of noise in the dataset via the noise parameter. This allows you to control how much the generated points can deviate from the ideal shape of the semicircles.
Number of points: You can specify the number of points generated via the n_samples parameter.
Distribution: The points within each semicircle are evenly distributed along the curve of the semicircle.

from sklearn.datasets import make_moons
import matplotlib.pyplot as plt

X, y = make_moons(n_samples=1000, noise=0.1, random_state=42)

plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', s=50, alpha=0.7)
plt.title('Moon-shaped dataset')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar(label='Classe')
plt.grid(True)
plt.show()

Running we obtain the following dataset:

Compared to other datasets generated with functions such as make_blobs, make_circles, make_s_curve, etc., the dataset generated with make_moons has a more complex and non-linear shape. It is particularly useful for testing clustering algorithms in scenarios where clusters may have non-standard or non-linear shapes. For example, make_moons is useful for testing clustering algorithms that need to identify clusters that cannot be separated by simple hyperplanes, such as the case of linear separation that a k-means-based clustering algorithm might have to deal with.

Circle-shaped Dataset

The dataset generated by Scikit-learn’s make_circles function is designed to simulate a set of data that has a circular or ring-shaped structure. This dataset has the following characteristics:

Circular structure: The dataset contains points distributed in concentric circles, similar to a target. This circular structure makes it useful for testing clustering algorithms that need to identify groups of data arranged in a circular or ring-shaped pattern.

Noise: The make_circles function allows you to add a controlled level of noise to the generated data. This helps make the dataset more realistic and suitable for testing the effectiveness of clustering algorithms in the presence of noise in the data.

factor parameter: The factor parameter allows you to adjust the distance between the generated concentric circles. This allows you to create datasets with different levels of separation between data groups, allowing you to test the effectiveness of clustering algorithms in contexts with different spatial distributions of data.

import matplotlib.pyplot as plt
from sklearn.datasets import make_circles

# Generate the circle-shaped dataset
X, y = make_circles(n_samples=1000, noise=0.1, factor=0.5)

# Visualize the dataset
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', s=50, alpha=0.7)
plt.title("Circle-shaped Dataset")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.grid(True)
plt.show()

Here is an explanation of the parameters used in the make_circles function:

n_samples: This parameter specifies the number of points to generate in the dataset. In our example, n_samples=1000 indicates that we want to generate a dataset with 1000 points.
noise: This parameter controls the level of noise in the generated dataset. The higher the noise value, the higher the noise level in the generated data. In our example, noise=0.1 means that we will add some noise to the data, but not to a significant extent.
factor: This parameter determines the distance between the two concentric circles generated. A higher value of factor produces circles that are larger and further apart from each other, while a lower value produces circles that are smaller and closer together. In our example, factor=0.5 indicates that the two circles will be relatively close together, with a radius equal to half the overall radius.

Dataset for clustering - cicle-shaped labeled

The dataset generated by make_circles is commonly used to test and evaluate clustering algorithms that are capable of identifying and distinguishing groups of data arranged in a circular or ring-shaped manner. For example, algorithms such as agglomerative clustering, K-means or DBSCAN can be tested on this dataset to evaluate their ability to identify and distinguish the concentric circles or ring-shaped structure of the data.

Compared to other datasets generated by similar functions, such as make_blobs or make_moons, the dataset generated by make_circles has a different structure and can be used to test clustering algorithms on circular or ring-shaped data patterns, rather than on linear or globular data sets .

Gaussian Quantiles Dataset

The dataset generated by make_gaussian_quantiles is composed of samples distributed according to a multivariate Gaussian distribution. This means that the points in the dataset are arranged in clusters that follow a normal distribution in multiple dimensions. The main characteristics of this dataset include:

Gaussian distribution: The samples are distributed so that the values of each feature follow a normal distribution.
Clusters: The dataset contains a specified number of Gaussian clusters, each with its own mean and variance.
Quantiles: Clusters can be generated so that they are separated into quantiles, i.e. regions of the distribution.

from sklearn.datasets import make_gaussian_quantiles
import matplotlib.pyplot as plt

# Generate a dataset with 2 Gaussian clusters
X, y = make_gaussian_quantiles(n_samples=1000, n_features=2, n_classes=2)

# Plot the dataset
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', edgecolors='k')
plt.title('Gaussian Quantiles Dataset')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar(label='Class')
plt.grid(True)
plt.show()

Let’s analyze the parameters used to control the generation of the dataset via the make_gaussian_quantiles function:

n_samples: This parameter specifies the total number of samples you want to generate in the dataset. In the case of the example, n_samples=1000 means that the dataset will contain a total of 1000 samples.
n_features: This parameter specifies the number of features or variables for each sample in the dataset. For example, n_features=2 indicates that each sample will have two features.
n_classes: This parameter specifies the number of classes or clusters you want to generate in the dataset. With n_classes=2, the function generates two distinct clusters in the dataset.

So, in the context of the example, make_gaussian_quantiles(n_samples=1000, n_features=2, n_classes=2) indicates that we are generating a dataset with 1000 samples, each with two features, divided into two distinct clusters.

Dataset for clustering - gaussian quantiles

This type of dataset is often used in clustering to test algorithms and evaluate their performance. Compared to other datasets generated by similar make_ functions in Scikit-learn, such as make_blobs or make_moons, the dataset generated by make_gaussian_quantiles may be more suitable when you want to test clustering algorithms that are effective on data with Gaussian distributions or when you want to create a dataset with more complex and overlapping clusters. For example, in situations where clusters are not clearly separated, this feature can generate clusters with controlled overlaps.

S-Curve Dataset (3D and Unlabeled)

Il dataset generato da make_s_curve è un insieme di dati sintetici che segue una forma curva “S” nello spazio tridimensionale. Questo dataset è principalmente caratterizzato dalle seguenti proprietà:

“S” Curve Shape: Data is distributed along an “S” curve in three-dimensional space. This gives the dataset a non-linear structure.
Noise: It is possible to introduce noise into the generated data via the noise parameter. This can make the “S” curve more irregular and add variation to the data.
Parameter t: Each point in the dataset has an associated parameter called t, which represents the coordinate along the “S” curve. This parameter can be used to color the points in the plot to visualize the position along the curve.

from sklearn.datasets import make_s_curve
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Generate the "S" shaped dataset with 1000 points
X, t = make_s_curve(n_samples=1000, noise=0.1, random_state=42)

# Plot the dataset
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=t, cmap='viridis')
ax.set_title('S-Curve Dataset')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
plt.show()

This type of dataset can be used in clustering to test algorithms that can deal with data with nonlinear structure. Unlike datasets generated by make_ functions such as make_blobs or make_moons, make_s_curve offers a more complex and nonlinear structure, which can be useful for evaluating the ability of clustering algorithms to detect and handle such complexity. For example, density-based clustering algorithms or nonlinear algorithms such as DBSCAN or t-SNE could be tested on this type of dataset to see how they perform against nonlinear clusters.

Swiss Roll Dataset (3D and Unlabeled)

The dataset generated by make_swiss_roll is a synthetic representation of a three-dimensional Swiss roll, which is a common shape used for testing dimensionality reduction and clustering algorithms. This type of dataset has the following main characteristics:

Swiss roll shape: The points in the dataset are distributed along a three-dimensional spiral, resembling the shape of a Swiss roll. This aspect gives the dataset a complex and non-linear geometric structure.
Three-dimensional dimensionality: Each sample in the dataset is represented by three spatial coordinates (X, Y, Z), which represent the position of the point in three-dimensional space.
Optional noise: You can add noise to the dataset via the noise parameter, which can influence the distribution of points and the shape of the spiral.

from sklearn.datasets import make_swiss_roll
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Generate the Swiss roll dataset with 1000 samples
X, color = make_swiss_roll(n_samples=1000, noise=0.1)

# Extract coordinates
x = X[:, 0]
y = X[:, 1]
z = X[:, 2]

# Plot the dataset
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

ax.scatter(x, y, z, c=color, cmap=plt.cm.viridis, s=50)
ax.set_title('Swiss Roll Dataset')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
plt.show()

In this example, make_swiss_roll generates a 3D “Swiss roll” dataset with 1000 samples. The noise parameter controls the amount of noise added to the data.

Next, the data is visualized using matplotlib with a 3D visualization. Each point in the dataset has three coordinates (X, Y, Z) and is colored based on a color value extracted from the dataset itself, which can be useful for representing additional information such as class labels or other.

This type of dataset is often used in clustering to test algorithms on data with a complex, nonlinear structure. Compared to other datasets generated by similar make_ functions in Scikit-learn, such as make_blobs or make_moons, the dataset generated by make_swiss_roll has a more intricate and three-dimensional structure, making it suitable for evaluating the effectiveness of clustering algorithms on data with shapes more complex and non-linear. Furthermore, it can also be used to test dimensionality reduction algorithms, as it offers a three-dimensional representation that can be projected into lower-dimensional spaces.

3D labeled datasets

The unlabeled clusters are all three-dimensional. As for Scikit-learn’s make_blobs, make_moons, make_circles, make_gaussian_quantiles and similar functions they are primarily designed to generate two-dimensional (2D) datasets for data visualization and analysis purposes. However, there is nothing to prevent you from using some of these functions to generate datasets in more than two dimensions (3D or higher), but the representation and visualization of the data becomes more complex.

3D datasets can be generated using make_blobs and make_gaussian_quantiles (the others are not) by specifying the appropriate number of features via the n_features parameter. For example, setting n_features=3 will generate data with three dimensions.

Here’s an example of how you might use make_blobs to generate a 3D dataset:

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Generate a dataset with 3 features (3D)
X, y = make_blobs(n_samples=1000, n_features=3, centers=5, random_state=42)

# Plot the dataset in 3D
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, cmap='viridis', edgecolors='k')
ax.set_title('Example of 3D dataset with make_blobs')
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
ax.set_zlabel('Feature 3')
plt.show()