K-Nearest Neighbors (K-NN) with scikit-learn for Classification Problems in Machine Learning

K-Nearest Neighbor for classification header

The K-Nearest Neighbor algorithm is a supervised learning method used for classification and regression. The main idea is to classify a data point based on the majority of labels of its k nearest instances in the training set. “Closeness” is often measured using Euclidean distance.

K-Nearest Neighbors

The K-Nearest Neighbors (KNN) algorithm has a fascinating history that traces its roots back to the 1950s with the emergence of pattern recognition theory. Around that time, scientists began to explore statistical methods and classification algorithms to distinguish and identify complex patterns in data.

As the 1960s and 1970s progressed, fundamental concepts in the field of pattern recognition and machine learning were introduced, such as Euclidean distance and instance-based classification. These concepts were essential precursors to the development of KNN.

In the 1980s, with the advent of more powerful computers and growing interest in applying machine learning techniques, the KNN algorithm began to take shape in its modern format. During the 1990s and 2000s, with the explosion of data and improvements in computational capabilities, KNN became increasingly relevant, finding applications in fields such as bioinformatics, speech recognition, and medicine.

Today, KNN remains a critical tool in the machine learning arsenal. While it can be outperformed by more complex algorithms on very large datasets, its conceptual simplicity and relative ease of implementation make it a popular choice, especially for those new to machine learning.

This algorithm is used to solve problems:

  • classification
  • regression

In short, for classification, K-NN finds the k nearest neighbors of a test point, looks at their labels, and assigns the test point the label that appears most frequently among the neighbors. For regression, instead of class labels, average or weighted values of the k neighbors are used to predict the value of the test point.

It is important to choose an appropriate value for k and determine the appropriate distance metric based on your specific problem. K-NN can be sensitive to data scale and may require data preprocessing to achieve optimal results.

In the K-Nearest Neighbors (K-NN) algorithm, the distance between points can be calculated using several metrics, but the most common is the Euclidean distance. The formula for the Euclidean distance between two points  P(x_1, y_1) and  Q(x_2, y_2) in two-dimensional space is:

 d(P, Q) = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}

Generalizing for an (n)-dimensional space, the Euclidean distance between two points (P) and (Q) is given by:

 d(P, Q) = \sqrt{\sum_{i=1}^{n} (q_i - p_i)^2}

Where  p_i e  q_i are the coordinates of the point  P and  Q respectively along the  i axis.

As for the choice of  k , it is a hyperparameter that can be adjusted to achieve the best performance in the specific application. Too low a value could make the algorithm sensitive to noise, while too high a value could make the algorithm too generic.

Practical implementation of K-NN also involves voting or averaging of nearest neighbors for classification or regression, respectively.

Suggested Book

If you are interested in Machine Learning with Python I suggest you read this book:

Machine Learning with Python Cookbook

K-Nearest Neighbors in Scikit-Learn

Scikit-learn, one of the most popular libraries for machine learning in Python, includes an implementation of KNN in the neighbors module. The neighbors module of scikit-learn contains several classes and functionality for building and using proximity-based models in your data. Here is a brief overview of the main components of this module:

  • KNeighborsClassifier: This class implements the K-Nearest Neighbors classifier for classification problems. Allows you to specify the number of neighbors (n_neighbors) and other options for calculating distance and neighbor weight.
  • KNeighborsRegressor: Similar to the classifier, this class implements the K-Nearest Neighbors regressor for regression problems. It accepts the same parameters as the classifier.
  • NearestNeighbors: This class provides a basis for calculating nearest neighbors. This is useful when you want to access neighbors without necessarily having to train a model, such as for clustering or other proximity-based operations.
  • RadiusNeighborsClassifier and RadiusNeighborsRegressor: These classes implement versions of KNN that use a radius distance instead of a fixed number of neighbors. This means that all points within a given radius of a query point are considered.
  • UnsupervisedNeighborsMixin: This mixin class is used as the basis for neighborhood-based clustering algorithms.
  • kneighbors_graph: A function for calculating the adjacency matrix based on nearest neighbors.

These are just some of the main components of the scikit-learn neighbors module. The official scikit-learn documentation provides more in-depth details and examples on using these classes and features.

Example of a Classification problem with K-Nearest Neighbors

of length and width of the sepals and petals of three iris species: Iris-setosa, Iris-versicolor and Iris-virginica. The goal is to correctly predict the iris species based on these measurements.

Using the K-Nearest Neighbors (KNN) algorithm to solve this problem has several advantages:

  • Simplicity: KNN is an intuitive and easy to understand classification algorithm. It does not require training a complex model and makes no assumptions about the distribution of the data.
  • Adaptability to data: KNN is particularly useful when there is no a priori information about the data distribution. Because it bases predictions on closeness in the data, it can accommodate complex, nonlinear structures in the data.
  • Interpretability: KNN predictions can be interpreted intuitively. Because it is based on nearest neighbor classification, you can easily see nearby points and understand why a particular instance was classified a certain way.
  • Good performance on small datasets: KNN can perform well on moderate or small sized datasets, as is the case with the Iris dataset. It does not require intensive processing and can provide good results even with a relatively small dataset.

Let’s then implement the example code.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Loading the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating a KNN classifier with 3 neighbors
knn = KNeighborsClassifier(n_neighbors=3)

# Training the classifier
knn.fit(X_train, y_train)

# Prediction phase on the test set
y_pred = knn.predict(X_test)

# Calculating the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

After loading the dataset, we split the data into training and test sets using train_test_split(). Next, we create a KNN classifier with 3 neighbors using KNeighborsClassifier(), train it with the training data using the fit() method, and finally make predictions on the test data using the predict() method. Finally, we calculate the accuracy of the model by comparing the predictions to the true labels using accuracy_score().

Accuracy: 1.0

In order to have a graphical representation of our result we can use the following code:

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

# Create scatterplot to visualize data points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=ListedColormap(['red', 'green', 'blue']), edgecolor='k', s=20)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('KNN Classification Result')
plt.show()

Executing we obtain the following graph:

KNN for classification - scatter plot

The representation we obtained is two-dimensional, but since the KNN model was trained on a dataset with four features, its correct representation should be four-dimensional. The four features in the Iris dataset include sepal length and width, and petal length and width.

However, because viewing a four-dimensional graph is very complex, two-dimensional or three-dimensional graphs are typically used to visually represent the result of a classification model. In the previous example, we chose to use only the first two features (sepal length and width) for the meshgrid and visualization, while the other two features were kept constant at zero.

This is a simplification to allow a clearer and more intuitive display. However, it is important to keep in mind that we are only visualizing a portion of the feature space and that the visualization does not take into account the other two features of the Iris dataset. Let’s see together all the possible combinations of the relationships between the features.

KNN for classification - scatter plot all features

Decision Boundary

The “Decision Boundary” is a line, hyperplane or surface in feature space that separates different classes in classification problems. In other words, it is the boundary that the classification model uses to distinguish between different categories of data.

In the context of supervised learning, when we train a classification model, the goal is to find a Decision Boundary that minimizes classification error on the training data. This Decision Boundary can be linear, if the problem is linearly separable, or complex if the problem requires non-linear separation.

For example, considering a binary classification problem where data is represented by points in a two-dimensional space. The Decision Boundary would be a line that separates the points of one class from those of the other class. If you have more than two classes, the Decision Boundary can be a hyperplane or a surface that separates the different classes in the feature space.

A good Decision Boundary is one that generalizes well to unseen data, so an important consideration in training classification models is finding a balance between model complexity and generalization ability. A model that is too simple may not be able to capture the complexity of the data, while a model that is too complex may suffer from overfitting, that is, it may overfit the training data, losing the ability to generalize to new and unseen data.

Returning to our example. Even if our problem is four-dimensional (4 features), we can still visualize the Decision Boundary, but we will have to make some simplifications. One possibility is to project the feature space onto a lower dimensional space, for example using a dimensionality reduction technique such as PCA (Principal Component Analysis).

To project the feature space onto a two-dimensional space using the dimensionality reduction technique, we can employ Principal Component Analysis (PCA) or Linear Discriminant Analysis (LDA). In this example, we will use PCA to reduce the size of the feature space to 2.

Here’s how you can modify the above code to include dimensionality reduction using PCA and display Decision Boundaries in two-dimensional space:

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.decomposition import PCA

# Applicazione della PCA per ridurre lo spazio delle features a 2 dimensioni
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Creazione di un meshgrid per visualizzare le Decision Boundary
x_min, x_max = X_pca[:, 0].min() - 1, X_pca[:, 0].max() + 1
y_min, y_max = X_pca[:, 1].min() - 1, X_pca[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))

# Addestramento del classificatore KNN utilizzando le features ridotte
knn_pca = KNeighborsClassifier(n_neighbors=3)
knn_pca.fit(X_pca, y)

# Predizione delle classi per ogni punto nel meshgrid
Z = knn_pca.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Creazione del grafico di dispersione per visualizzare le Decision Boundary
plt.contourf(xx, yy, Z, alpha=0.3)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap=ListedColormap(['red', 'green', 'blue']), edgecolor='k', s=20)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('KNN Classification Result with PCA')
plt.show()

In this code, we applied PCA to the Iris dataset to reduce the feature space to two dimensions. We then trained the KNN classifier using the reduced features and visualized the Decision Boundaries in the two-dimensional PCA space. This allows us to view the classification result in a two-dimensional format, even if the original dataset has four features.

KNN for classification - scatter plot all features

Two-dimensional projections of the decision boundaries of the four-dimensional model without dimensionality reduction would lead to erroneous conclusions.

If, however, you want to reason two-dimensionally directly about the features, you must necessarily choose 2 of the four and build a new exclusive KNN learning model on those two. Take for example features 1 and 2 of the iris databases.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Loading the Iris dataset
iris = load_iris()
X = iris.data[:, :2]  # Utilizziamo solo le prime due features
y = iris.target

# Division of the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating a KNN classifier with 3 neighbors
knn = KNeighborsClassifier(n_neighbors=3)

# Classifier training
knn.fit(X_train, y_train)

# Prediction phase on the test set
y_pred = knn.predict(X_test)

# Calculation of model accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Creating a meshgrid to display decision regions
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))

# Predicting classes for each point in the meshgrid
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Create scatterplot to visualize data points and decision regions
plt.contourf(xx, yy, Z, cmap=ListedColormap(['red', 'green', 'blue']), alpha=0.3)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=ListedColormap(['red', 'green', 'blue']), edgecolor='k', s=20)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('KNN Classification Result')
plt.show()

Executing you get the following result:

Accuracy: 0.8
KNN for classification - bidimensional decision boundary

Graphically it might all seem very coherent, with a good correspondence between the green points of the scatterplot and the corresponding areas they belong to. But in reality the accuracy value dropped from 1.0 (certainty) to 0.8. This is because during learning the model lost a lot of information contained in the other two features 3 and 4 which, although not easily representable (for us) are fundamental for a good level of prediction to be

Some Considerations on the K-Nearest Neighbor model

Now that we have an idea of how K-Nearest Neighbor works, it is important to make some useful considerations to better understand this model.

  • Sensitivity to data size and scale: K-NN is sensitive to the scale of the variables, so it is advisable to normalize the data before using the algorithm. Furthermore, in high-dimensional spaces, the Euclidean distance can become less significant, and this can affect the performance of K-NN.
  • Choice of parameter k: The choice of the number of neighbors ((k)) is critical. A (k) value that is too small can make the model sensitive to noise, while a (k) value that is too large can lead to a model that is too generic. The choice of (k) should be guided by the nature of the problem and may require the use of optimization techniques.
  • Computationally Expensive: The main disadvantage of K-NN is that it can be computationally expensive, especially on large datasets. Calculating the distance between all points can take time, especially when the number of dimensions is large.
  • Sensitive to noise: K-NN can be sensitive to noise and outliers in the data. The presence of noisy data or outliers can negatively affect forecasts.
  • Memory: K-NN requires storing the entire training set, which can become problematic with large amounts of data.
  • Non-parametric and adaptive: K-NN is a non-parametric model, meaning it does not make hard assumptions about the shape of the decision function. This makes it adaptive to different forms of data, but it may require more data for optimal performance.

Good for small datasets or with complex structures: K-NN can be particularly useful when dealing with small datasets or complex structures that are not easily modeled with parametric algorithms.

In general, K-NN is a simple and intuitive algorithm, but its effectiveness strongly depends on the nature of the problem and the quality of the data. It is often used as a baseline or as part of a more complex model ensemble.

Leave a Reply