Linear Regression with Elastic Net in Machine Learning with scikit-learn

Post Views: 99

Elastic Net is a linear regression technique that adds a regularization term by combining both the L1 penalty (as in Lasso regression) and the L2 penalty (as in ridge regression). So, it is based on the linear regression model, but with the addition of these penalties to improve the performance of the model, especially when there are multicollinearities between the variables or you want to make a selection of the variables.

Elastic Net

The Elastic Net was introduced in 2005 by two researchers, Hui Zou and Trevor Hastie, in their article entitled “Regularization and variable selection via the elastic net”, published in the journal “Journal of the Royal Statistical Society: Series B (Statistical Methodology )”.

Zou and Hastie developed Elastic Net as a solution to address the limitations of Lasso regression and ridge regression, two widely used regression techniques. Both of these techniques had their advantages, but also significant drawbacks: Lasso regression tended to select a small subset of predictor variables, while ridge regression retained all variables but performed no true variable selection.

The Elastic Net combined the features of both methods, introducing a mixed regularization that includes both the L1 penalty and the L2 penalty. This allowed us to obtain the variable selection benefits of Lasso regression and the stability of Ridge regression.

Zou and Hastie’s paper sparked great interest in the statistics and machine learning community, leading to the widespread adoption of Elastic Net in various application areas. Since then, Elastic Net has become a very popular tool for regression and data analysis, used to address a wide range of problems, including those with high-dimensional data and multicollinearity.

Elastic Net is a regression model that combines aspects of linear regression and Lasso (Least Absolute Shrinkage and Selection Operator) regression to handle multicollinearity and variable selection problems.

Traditional linear regression can suffer from multicollinearity problems when the independent variables are highly correlated with each other. Lasso regression addresses this problem by imposing a penalty on the sum of the absolute values of the coefficients during the training process, which tends to reduce some coefficients to zero, thus performing a kind of variable selection.

However, Lasso regression can be too strict in its variable selection, eliminating too many coefficients and potentially ignoring useful variables.

Elastic Net aims to overcome these limitations by combining the Lasso regression penalty with an additional penalty, similar to the L2 norm of ridge regression. This allows Elastic Net to maintain some of the advantages of Lasso regression in variable selection, while at the same time alleviating the excessive tendency to select variables when there are strong correlations between predictors.

The general form of the objective function for the Elastic Net is:

$\text{minimize} \left( \frac{1}{2n} ||\mathbf{y} - \mathbf{X}\beta||_2^2 + \lambda_1 ||\beta||_1 + \lambda_2 ||\beta||_2^2 \right)$

Where:

$\mathbf{y}$ is the response vector;
$\mathbf{X}$ is the feature matrix;
$\beta$ is the vector of predictor coefficients;
$||\cdot||_1$ is the L1 norm;
$||\cdot||_2$ it is the L2 norm;
$\lambda_1$ and $\lambda_2$ are the regularization parameters.

In summary, Elastic Net offers greater flexibility than Lasso regression and ridge regression, allowing you to effectively handle multicollinearity and variable selection problems.

Elastic Net with scikit-learn for Linear Regression

Elastic Net is integrated into the Python scikit-learn library, which is one of the most used libraries for machine learning. In scikit-learn, you can use the ElasticNet class to train Elastic Net regression models.

In this example, we will use an Elastic Net regression model that will train on randomly generated synthetic data using make_regression from scikit-learn. Once a dataset has been generated, it will be divided into two portions: training set and testing set. We will use the fit function to train the model on the training set. Once trained, we will make predictions on the testing set using predict. To evaluate the performance of the model, the mean square error is used as a metric. This value is easily obtained from the mean_squared_error function provided by scikit-learn. Here is the code that does all these tasks:

from sklearn.linear_model import ElasticNet
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Generate synthetic data for the example
X, y = make_regression(n_samples=100, n_features=10, noise=0.1, random_state=42)

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the Elastic Net model
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5, random_state=42)

# Train the model on the training data
elastic_net.fit(X_train, y_train)

# Make predictions on the test set
predictions = elastic_net.predict(X_test)

# Evaluate the model's performance
mse = mean_squared_error(y_test, predictions)
print("Mean Squared Error:", mse)

Running you get the MSE value:

Mean Squared Error: 176.0283275056508

Since this is a value which must be as small as possible, but read in absolute value it does not give us any information. We can use a very useful graphical representation that allows us to understand how the predicted values differ from the real ones for the entire range of the dataset. Here is the code to generate the graph:

import matplotlib.pyplot as plt

# Plot of predictions versus actual values
plt.scatter(y_test, predictions)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], 'r--')
plt.xlabel("Actual Values")
plt.ylabel("Predictions")
plt.title("Scatter plot: Actual Values vs Predictions")
plt.show()

Executing you obtain the graph described previously.

Elastic Net linear regression - scatter plot dataset

As we can see the points do not deviate much from the red diagonal (predicted value = real value). So our model proved to be a good prediction model in this case.

A metric that gives us similar information is R^2 (or coefficient of determination). This is a common metric used to evaluate how well regression models predict. R^2 measures the proportion of variance in the dependent variable that is explained by the model. A value closer to 1 indicates a better model, while a value closer to 0 indicates a worse model.

from sklearn.metrics import r2_score

r2 = r2_score(y_test, predictions)
print("Coefficient R^2:", r2)

Running we get the value of the metric:

Coefficient R^2: 0.9970546671780763

As we can see, it is a value very close to 1. This only confirms what was said after viewing the previous graph.

Example with the diabetes dataset for a Linear Regression with Elastic Net

So far the model has performed very well in predicting artificially generated values. But how will it behave with a dataset of real data, such as diabetes provided by the scikit-net library.

The “diabetes” dataset included in the scikit-learn library is an example dataset that contains measurements derived from diabetes patients. It is commonly used for machine learning and data analysis purposes.

Here is a quick description of the characteristics of the “diabetes” dataset:

Number of samples: 442
Number of attributes/predictors: 10
Type of attributes/predictors: Numeric (float)
Response variable/target: Quantitative values measured, representing diabetes disease progression over one year.

The 10 attributes/predictors represent:

Age: Age of patients.
Sex: Gender of patients (0 for male, 1 for female).
Body mass index (BMI): Body mass index.
Average blood pressure (BP): Average blood pressure.
S1, S2, S3, S4, S5, S6: Six serological measurements derived from a blood serum.

The target represents diabetes disease progression over one year and is represented as a quantitative measure of disease progression.

This dataset is often used for educational examples and to evaluate the performance of regression algorithms in machine learning. Let’s see how our Elastic Net model behaves in this regard.

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error, r2_score

# Load the "diabetes" dataset
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the Elastic Net model
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5, random_state=42)

# Train the model on the training data
elastic_net.fit(X_train, y_train)

# Make predictions on the test set
predictions = elastic_net.predict(X_test)

# Calculate the Mean Squared Error (MSE) and the coefficient R^2
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print("Mean Squared Error (MSE):", mse)
print("Coefficient R^2:", r2)

Running you get the values:

Mean Squared Error (MSE): 4775.466767154695
Coefficient R^2: 0.09865421116113748

These are certainly not excellent results…

Improving the model: the search for optimal parameters

You might consider adopting an optimal parameter search strategy using random search to explore different combinations of hyperparameters for your Elastic Net model. Here is an example of how we could do this using scikit-learn’s RandomizedSearchCV class:

from sklearn.model_selection import RandomizedSearchCV
import numpy as np

# Define the parameter grid to explore
param_grid = {
    'alpha': np.linspace(0.1, 1.0, 10),  # alpha values from 0.1 to 1.0
    'l1_ratio': np.linspace(0.1, 0.9, 9)  # l1_ratio values from 0.1 to 0.9
}

# Initialize the Elastic Net model
elastic_net = ElasticNet(random_state=42)

# Search for optimal parameters using random search
random_search = RandomizedSearchCV(estimator=elastic_net, param_distributions=param_grid, n_iter=100, cv=5, scoring='neg_mean_squared_error', random_state=42)
random_search.fit(X_train, y_train)

# Get the best model
best_elastic_net = random_search.best_estimator_

# Make predictions on the test set
predictions = best_elastic_net.predict(X_test)

# Calculate Mean Squared Error (MSE) and Coefficient R^2
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print("Mean Squared Error (MSE):", mse)
print("Coefficient R^2:", r2)
print("Best parameters:", random_search.best_params_)

In this code, we define a grid of parameters to explore for alpha and l1_ratio, and use RandomizedSearchCV to perform a random search on this grid. After finding the optimal parameters, we train a new Elastic Net model using these optimal parameters and evaluate the model’s performance on the test set. Finally, we print the MSE, R^2 and the best parameters found.

Mean Squared Error (MSE): 3792.129166396345
Coefficient R^2: 0.2842543312471031
Best parameters: {'l1_ratio': 0.9, 'alpha': 0.1}

We are still far away…

Improving the model: data normalization

The results obtained, although improved compared to the previous iteration, are still not satisfactory. However, we can continue to explore additional strategies to try to improve the model’s performance.

Certainly! Data normalization is a common practice in machine learning that can help improve model performance, especially when data features have different scales. We can use standardization or min-max normalization to normalize the data.

Here is an example of how we can apply data standardization using scikit-learn’s StandardScaler and then retrain the Elastic Net model:

from sklearn.preprocessing import StandardScaler

# Data standardization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Training the Elastic Net model on standardized data
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.9, random_state=42)
elastic_net.fit(X_train_scaled, y_train)

# Predictions on the test set
predictions = elastic_net.predict(X_test_scaled)

# Calculating MSE and R^2
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print("Mean Squared Error (MSE) with normalization:", mse)
print("Coefficient R^2 with normalization:", r2)

In this code, we apply data standardization using StandardScaler on training and test data. Next, we train the Elastic Net model on the standardized data and calculate the MSE and R^2 coefficient using the model’s predictions.

Mean Squared Error (MSE) with normalization:: 2878.8291644017645
Coefficient R^2 with normalization:: 0.45663520015111103

We have doubled the predictive capacity of the model, but still a value of 0.45 of R^2 is still too low. Normalizing the data can help the model converge more quickly and can improve overall performance. However, it is important to test the effect of data normalization on model performance and evaluate whether it actually improves performance.

Improving the Model: Feature Engineering

Feature engineering is an important step in the process of developing a machine learning model. We can explore different transformations of existing features or create new features based on existing ones to try to better capture the relationships between the independent and dependent variables.

Here are some examples of possible feature engineering techniques we might consider for the “diabetes” dataset:

Polynomial Features: We can create new features as polynomials of existing features, for example by adding quadratic or cubic features to capture non-linear relationships.
Interactions between features: We can create new features as interactions between existing features, for example by multiplying two features together.
Feature transformations: We can apply transformations to existing features, such as logarithmic or quadratic, to capture nonlinear relationships or to improve data distribution.
Dimension Reduction: We can explore dimension reduction techniques such as PCA (Principal Component Analysis) to reduce the dimensionality of the dataset while retaining most of the information.
Encoding categorical variables: If the dataset contains categorical variables, we can explore different encoding techniques, such as one-hot encoding or label encoding, to represent these variables appropriately.
Adding external information: If available, we can add external information to the dataset that may be relevant to the problem, such as demographic data or patient information.

Among the possible options we try to reduce the dimensions, using the PCA technique and apply polynomial features. To do this, we define a pipeline that combines data normalization with PCA. Next, we pipeline the training and test data and then add the degree 2 polynomial features.

from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

# Definition of the pipeline with normalization and PCA
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("pca", PCA(n_components=0.95))  # Preserve 95% of variance
])

# Application of the pipeline to training and test data
X_train_pca = pipeline.fit_transform(X_train)
X_test_pca = pipeline.transform(X_test)

# Adding polynomial features of degree 2
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train_pca)
X_test_poly = poly.transform(X_test_pca)

# Training the Elastic Net model on polynomial features
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.9, random_state=42)
elastic_net.fit(X_train_poly, y_train)

# Predictions on the test set
predictions = elastic_net.predict(X_test_poly)

# Calculating MSE and R^2
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print("Mean Squared Error (MSE) with PCA and polynomial features:", mse)
print("Coefficient R^2 with PCA and polynomial features:", r2)

Mean Squared Error (MSE) with PCA and polynomial features: 2605.224776619983
Coefficient R^2 with PCA and polynomial features: 0.5082766783059008

It’s a further improvement in the model’s performance! Combining data normalization with PCA and polynomial features led to a further reduction in mean square error (MSE) and an increase in R^2 coefficient, indicating that the model is providing better predictions.

And so on. However in these cases, you should evaluate the possibility of using different methods and see if you get better results, before continuing to proceed with increasingly complex optimization processes (since we are still around R^2 = 0.5).

When to use Elastic Net in Linear Regressions

The previous example makes us understand that choosing the right model can make a difference in forecasting capabilities, especially within each dataset. Let’s see some rules that could be useful in this regard.

The choice between Elastic Net and other linear regression methods depends on the specific characteristics of your problem and data. Here are some considerations for when it might be appropriate to choose Elastic Net over other linear regression methods:

Handling multicollinearity: If the dataset contains variables that are highly correlated with each other, Elastic Net may be preferable to standard linear regression (OLS) as it combines L1 and L2 regularization to better address the issue of multicollinearity.
Variable selection: If your goal is to select a subset of predictor variables, Lasso may be preferable as it tends to reduce some coefficients to zero, thus performing automatic variable selection. However, if you want variable selection that is more stable and less prone to sampling errors, Elastic Net may be preferable.
Stable Predictions: If your main goal is to obtain stable predictions, Ridge may be preferable as it reduces the variance of the model while maintaining all variables. However, if you have a regression problem where some variables are deemed to be of little relevance and could be excluded, Elastic Net may be a better choice than Ridge.
Balance between bias and variance: Elastic Net tries to find a balance between bias (systemic error) and variance (sensitivity to training data). If you need a model with a good compromise between these two sources of error, Elastic Net may be an appropriate choice over other methods.
Robustness to violations of linear regression assumptions: Elastic Net is relatively robust to violations of linear regression assumptions, such as normality of residuals and homoscedasticity of errors, compared to standard linear regression.

Elastic Net is a regularization method that combines both L1 (lasso) regularization and L2 (ridge) regularization. It is particularly useful when there are many predictor variables in the dataset or when these variables are highly correlated with each other (multicollinearity).

However, there is no regression method that is universally optimal for all types of datasets. There are several reasons why Elastic Net may not work well with a specific dataset like “diabetes”:

Dataset size: Elastic Net tends to work best when there are many predictor variables relative to the number of observations in the dataset. If the dataset has a limited number of observations relative to the number of variables, regularization methods such as Elastic Net may not be able to adequately capture the structure of the data.
Complex nonlinear relationships: Elastic Net is a linear model and may not be able to capture complex nonlinear relationships present in your data. In that case, you may need to use more complex models such as neural networks or decision tree-based models.
Little correlation between predictor variables: If the predictor variables in the dataset are not strongly correlated with each other, adding Elastic Net L1 regularization may not lead to a significant reduction in coefficients. In this case, it may be better to use Ridge’s L2 regularization, which tends to work better when the variables are poorly correlated.
Non-normalized features: Elastic Net and other regularized regression methods can be sensitive to feature scale. If the features are not normalized or standardized properly, this could negatively affect the performance of the model.

In summary, if Elastic Net does not work well with a certain dataset such as “diabetes”, you may need to explore other options, such as hyperparameter tuning, feature engineering, or using more complex models, to achieve better performance .