Support Vector Machines (SVM) for Classification problems in Machine Learning with scikit-learn

Post Views: 41

Support Vector Machines (SVMs) are a fundamental tool in the field of Machine Learning, particularly useful for tackling classification and regression problems. Their effectiveness is manifested above all in situations where the size of the data is much larger than the number of training examples available.

Support Vector Machines

Support Vector Machines (SVMs) have a fascinating history that begins in the 1960s and 1970s with the pioneering work of Vladimir Vapnik and Alexey Chervonenkis at the Institute of Control and Computation of the USSR Academy of Sciences. In those years, they were developing statistical learning theory, which would lay the conceptual foundation for SVMs.

The real turning point occurred in the 1990s, when Vapnik and Chervonenkis developed the SVM learning algorithm. This marked the beginning of the success of SVMs in the field of machine learning. SVMs quickly gained popularity due to their exceptional performance in classification problems, especially when the number of feature dimensions far exceeded the number of training examples.

A major contribution to SVMs was the introduction of the kernel trick. This technique allowed SVMs to handle nonlinear data by mapping it into a higher-dimensional space, thus opening the way to a wide range of more complex applications.

So, Support Vector Machines (SVMs) are a powerful supervised learning algorithm used for classification and regression problems. The main goal of SVMs is to find the optimal hyperplane that separates the classes as best as possible.

Given a set of training points $(x_i, y_i)$ , where $x_i$ represents the characteristics of the point and $y_i$ is the associated class label (generally -1 or +1 for binary classification problems), the optimal hyperplane is defined as:

$w \cdot x + b = 0$

Where $w$ is the vector of weights (coefficients) and $b$ is the bias term.

Then the maximum margin must be found. The margin is the distance between the hyperplane and the closest points of each class. The goal of SVMs is to maximize this margin. The margin is calculated as the distance between two parallel hyperplanes (one for each class) closest to the separation plane. The distance between these two hyperplanes is:

$\frac{2}{\lVert w \rVert}$

Where $\lVert w \rVert$ represents the Euclidean norm of $w$ .

To find the optimal hyperplane, we solve the following optimization problem:

$\text{minimize } \frac{1}{2} \lVert w \rVert^2$

subject to constraints:

$y_i(w \cdot x_i + b) \geq 1 \text{ per } i = 1, 2, …, n$

These constraints ensure that each training point is beyond the correct margin from the hyperplane.

Introduction of Softer Margin

In some cases, the data may not be linearly separable. In these cases, we use a modified version of the optimization problem, introducing slack variables $\xi_i \geq 0$ to allow for an error in the margin:

$y_i(w \cdot x_i + b) \geq 1 - \xi_i$

and minimizing:

$\frac{1}{2} \lVert w \rVert^2 + C \sum_{i=1}^{n} \xi_i$

Where (C) is a regularization parameter that controls the trade-off between maximizing the margin and reducing classification errors.

The Kernel Trick

To handle nonlinear problems, one can map the data into a higher-dimensional space using a kernel function $\Phi(x)$ . This allows you to compute the hyperplane in a higher dimensional space without actually performing the transformation. A common example is the RBF (Radial Basis Function) kernel:

$K(x, x') = \exp \left( -\gamma \lVert x - x' \rVert^2 \right)$

Where $\gamma$ is a kernel width parameter.

In summary, SVMs use the geometry of vector spaces and optimization theory to find the hyperplane that best separates the classes, guaranteeing a maximum margin. The kernel trick allows SVMs to effectively handle even nonlinear problems without having to explicitly perform feature transformation.

Support Vector Machines (SVM) in scikit-learn

Support Vector Machines (SVMs) are implemented in scikit-learn, one of the most popular libraries for Machine Learning in Python. Scikit-learn provides a wide range of machine learning algorithms, including SVMs, making them easily accessible to developers and researchers.

Within scikit-learn, SVMs are implemented through the sklearn.svm module. This module offers several classes for SVM, including:

SVC: For classification problems with SVM.
NuSVC: For SVM classification problems with support for classification error selection.
SVR: For regression problems with SVM.

These classes offer many options to customize the behavior of SVMs, such as the choice of kernel (linear, polynomial, RBF, etc.), regularization parameters, kernel parameters, and so on.

Example of a Classification problem with Support Vector Machines (SVM)

The classification problem addressed in this example uses the popular Iris dataset. This dataset consists of length and width measurements of the sepals and petals of three iris species: Iris-setosa, Iris-versicolor and Iris-virginica. The goal is to correctly predict the iris species based on these measurements. Let’s load the dataset included in scikit.

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()

df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target
df.head()

By running this portion of code, you will see the type of measurements contained within the dataset, which is loaded into a pandas Dataframe for convenience.

SVM Support Vector Machine - iris dataset

Now that we have the dataset with the features inside and the target to be able to evaluate the classification. Let’s divide it into two portions, X_train and y_train intended for learning the model and X_test and y_test for verifying the goodness of the model. Then we create an SVC model and train it with the available data. At the end of the model learning phase, an evaluation will be carried out between the predicted y_pred values and the real y_test values, obtaining the accuracy value.

from sklearn.svm import SVC

X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], test_size=0.2, random_state=42)

clf = SVC(kernel='linear')
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Running the code you get:

Accuracy: 1.0

From this value, we deduce that our model is very reliable (at least as regards the tested values).

If we want to visualize how the values of the dataset are distributed in the feature space we can use the following code:

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

X = iris.data
y = iris.target

plt.scatter(X[:, 0], X[:, 1], c=y, cmap=ListedColormap(['red', 'green', 'blue']), edgecolor='k', s=20)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('SVM Classification Result')
plt.show()

By executing this, you obtain the distribution of the elements of the dataset based on the first two features (out of 4 existing ones).

SVM Support Vector Machine - dataset scatterplot 1

The representation we obtained is two-dimensional, but since the SVC model was trained on a dataset with four features, its correct representation should be four-dimensional. The four features in the Iris dataset include sepal length and width, and petal length and width.

However, because viewing a four-dimensional graph is very complex, two-dimensional or three-dimensional graphs are typically used to visually represent the result of a classification model. In the previous example, we chose to use only the first two features (sepal length and width) for visualization, while the other two features were kept constant at zero.

This is a simplification to allow a clearer and more intuitive display. However, it is important to keep in mind that we are only visualizing a portion of the feature space and that the visualization does not take into account the other two features of the Iris dataset. Let’s see together all the possible combinations of the relationships between the features.

KNN for classification - scatter plot all features

The Decision Boundary

The “Decision Boundary” is a line, hyperplane or surface in feature space that separates different classes in classification problems. In other words, it is the boundary that the classification model uses to distinguish between different categories of data.

In the context of supervised learning, when we train a classification model, the goal is to find a Decision Boundary that minimizes classification error on the training data. This Decision Boundary can be linear, if the problem is linearly separable, or complex if the problem requires non-linear separation.

For example, considering a binary classification problem where data is represented by points in a two-dimensional space. The Decision Boundary would be a line that separates the points of one class from those of the other class. If you have more than two classes, the Decision Boundary can be a hyperplane or surface that separates the different classes in the feature space.

A good Decision Boundary is one that generalizes well to unseen data, so an important consideration in training classification models is finding a balance between model complexity and generalization ability. A model that is too simple may not be able to capture the complexity of the data, while a model that is too complex may suffer from overfitting, that is, it may overfit the training data, losing the ability to generalize to new and unseen data.

Returning to our example. Even if our problem is four-dimensional (4 features), we can still visualize the Decision Boundary, but we will have to make some simplifications. One possibility is to project the feature space onto a lower dimensional space, for example using a dimensionality reduction technique such as PCA (Principal Component Analysis).

To project the feature space onto a two-dimensional space using the dimensionality reduction technique, we can employ Principal Component Analysis (PCA) or Linear Discriminant Analysis (LDA). In this example, we will use PCA to reduce the size of the feature space to 2.

Here’s how you can modify the above code to include dimensionality reduction using PCA and display Decision Boundaries in two-dimensional space:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

iris = load_iris()

df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], test_size=0.2, random_state=42)

pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

clf = SVC(kernel='linear')
clf.fit(X_train_pca, y_train)

y_pred = clf.predict(X_test_pca)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

x_min, x_max = X_train_pca[:, 0].min() - 1, X_train_pca[:, 0].max() + 1
y_min, y_max = X_train_pca[:, 1].min() - 1, X_train_pca[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                     np.arange(y_min, y_max, 0.02))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.8)
plt.scatter(X_train_pca[:, 0], X_train_pca[:, 1], c=y_train, marker='o', edgecolors='k')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('Decision Boundary with PCA')
plt.show()

Running you get:

Accuracy: 0.9666666666666667

And the following graph with the decision boundaries that mark the 3 different areas belonging to the 3 classes.

SVM Support Vector Machine - decision boundary with PCA

Two-dimensional projections of the decision boundaries of the four-dimensional model without dimensionality reduction would lead to erroneous conclusions.

If, however, you want to reason directly about the features in a two-dimensional way, you must necessarily choose 2 of the four and build a new exclusive SVM learning model on those two. Take for example features 1 and 2 of the iris databases.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

iris = load_iris()

df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

X = df.iloc[:, :2]
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = SVC(kernel='linear')
clf.fit(X_train, y_train)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

x_min, x_max = X.iloc[:, 0].min() - 1, X.iloc[:, 0].max() + 1
y_min, y_max = X.iloc[:, 1].min() - 1, X.iloc[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01),
                     np.arange(y_min, y_max, 0.01))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.8)
plt.scatter(X_train.iloc[:, 0], X_train.iloc[:, 1], c=y_train, marker='o', edgecolors='k', label='Train')
plt.scatter(X_test.iloc[:, 0], X_test.iloc[:, 1], c=y_test, marker='x', edgecolors='k', label='Test')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Decision Boundary with SVM')
plt.legend()
plt.show()

Running you get:

Accuracy: 0.9

And as a decision boundary graph.

SVM Support Vector Machine - decision boundary 2 features

Quando usare le SVM nei problemi di classificazione?When to use SVMs in classification problems?

Support Vector Machines (SVMs) are an appropriate choice for various classification scenarios, but there are specific considerations that may guide the choice between SVM, KNN, and other methods. Here are some points to consider when deciding whether to use SVMs versus KNN or other classification methods:

Data size and complexity: SVMs tend to perform well when there are many features (high dimensionality) and the number of training examples is relatively small. In comparison, KNN can become computationally inefficient and less effective with a large number of features or a very large number of data points.
Model complexity: SVMs are able to effectively handle problems with complex decision boundaries, even in high-dimensional spaces, thanks to the use of nonlinear kernels. However, KNN tends to be more suitable for problems where the decision boundary is simpler or where the data structure is more “local”, i.e. when similar data points tend to cluster together in feature space.
Model interpretability: KNN provides classification based on “closeness” in the training data, which can be more interpretable than the “black-box” nature of SVMs, especially when using complex kernels. So, if model interpretability is a priority, KNN may be preferable.
Robustness to noisy data: SVMs tend to be more robust to noisy data than KNN. Since KNN relies on proximity in the training data, it is sensitive to noisy data points or outliers. SVMs, on the other hand, try to maximize the margin between classes, reducing the impact of individual outlier points.
Data Dimensionality: If you are dealing with a very large number of features, SVMs may be preferable to KNN as they are less sensitive to “dimensionality bias” (the phenomenon where the generalization ability of models decreases with increase dimensionality).

In summary, SVMs are often preferred when dealing with classification problems with a high number of dimensions, a limited number of training examples and complex decision boundaries. However, the choice between SVM, KNN and other classification methods will always depend on the specificity of the problem, the characteristics of the data and the needs of the application.