Linear regression with Lasso in Machine Learning with scikit-learn

Lasso Regression for Linear Regression header

Lasso (Least Absolute Shrinkage and Selection Operator) regression is a linear regression technique that uses L1 regularization to improve generalization and variable selection. Lasso regression is a powerful technique for linear regression that combines dimensionality reduction with the ability to select the most important variables, helping to create more interpretable and generalizable models.

The LASSO (Least Absolute Shrinkage and Selection Operator) regression

Lasso (Least Absolute Shrinkage and Selection Operator) regression was first introduced by Robert Tibshirani in 1996. It was developed as a regularization technique for linear regression, with the main goal of addressing the problem of overfitting and of variable selection.

The concept of Lasso regression has emerged as a solution to the variable selection problem, which occurs when using regression models with a large number of explanatory variables. In these situations, it is possible that many variables are not relevant to predicting the outcome, but can influence the model, leading to adequate performance on training data but poor generalization on new data.

Lasso regression addresses this problem by introducing an L1 penalty on the absolute sum of the model coefficients during the training process. This L1 penalty causes some coefficients to become exactly zero, thus reducing the number of variables used in the model. This automatic variable selection process makes Lasso regression particularly useful in situations where you want to identify the most important predictors among a large number of explanatory variables.

In the years since its introduction, Lasso regression has gained significant popularity in the scientific community and within machine learning, becoming one of the most widely used regularized regression methods alongside other techniques such as Ridge regression and Elastic Net.

Lasso Regression is based on the minimization of the cost function which includes two terms: the mean square error (MSE) term and an L1 penalty term.

The objective function of Lasso regression can be expressed as:

 \text{minimize } \frac{1}{2n} ||\mathbf{y} - \mathbf{X}\mathbf{w}||^2_2 + \alpha ||\mathbf{w}||_1

Where:

  •  \mathbf{y} is the vector of observed responses.
  •  \mathbf{X} is the feature matrix.
  •  \mathbf{w} is the vector of coefficients of the model to be learned.
  •  ||\cdot||_2 represents the norm (L2) (Euclidean norm).
  •  ||\cdot||_1 represents the norm (L1).
  •  \alpha is the regularization parameter that controls the level of penalty.

The part  \frac{1}{2n} ||\mathbf{y} - \mathbf{X}\mathbf{w}||^2_2 represents the mean square error (MSE) term, while  \alpha ||\mathbf{w}||_1 represents the L1 penalty term.

The L1 penalty (absolute sum of coefficients) encourages the model coefficients to become exactly zero, thus reducing the complexity of the model and leading to variable selection. This is useful for creating simpler and more interpretable models, as well as helping to prevent the problem of overfitting.

Lasso Regression with Scikit-learn

Lasso regression is implemented in the scikit-learn library, which is one of the most popular libraries for machine learning in Python. In scikit-learn, Lasso regression is available via the Lasso class within the linear_model module.

Let’s see an example that uses synthetic data, i.e. artificially generated data to simulate a dataset that should follow a linear trend with a background noise added specifically.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error

np.random.seed(0)
n_samples = 100
n_features = 10
X = np.random.randn(n_samples, n_features)
true_coefficients = np.random.randn(n_features)
y = X.dot(true_coefficients) + np.random.normal(0, 0.5, n_samples)

# Split data into training and test sets
X_train, X_test = X[:80], X[80:]
y_train, y_test = y[:80], y[80:]

# Create Lasso regression model
alpha = 0.1  # regularization parameter
lasso = Lasso(alpha=alpha)

# Train the model
lasso.fit(X_train, y_train)

# Predict on test data
y_pred = lasso.predict(X_test)

# Calculate mean squared error (MSE)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)

# Print model coefficients
print("Coefficients:", lasso.coef_)


Executing you get the following results:

Mean Squared Error (MSE): 0.3927541814497199
Coefficients: [ 0.50765085  0.76293123 -0.3276433   0.          0.15763891  0.11216079
  0.42298338 -1.73556608 -0.          0.09392823]

We obtained the value of the Mean Squared Error (MSE) which is the average of the squares of the errors between the predicted values and the actual values. The lower the MSE value, the better the model. Furthermore, we also obtained a series of coefficients related to the Lasso model. These coefficients are the weights assigned to each explanatory variable in the model. Each coefficient indicates how much a variable influences the destination variable.

Here’s what these coefficients mean and what they are for:

  • Bias Coefficient (Intercept): The first coefficient is often called the bias term or intercept. Represents the expected value of the target variable when all explanatory variables are zero.
  • Coefficient for Explanatory Variables: The other coefficients correspond to the explanatory variables. Each coefficient indicates how much a given variable affects the target variable. A positive coefficient indicates a positive association between the explanatory variable and the target variable, while a negative coefficient indicates a negative association.
  • Zero Coefficient: Some coefficients may be exactly zero. This means that the Lasso regression model excluded those variables from the final model. This is one of the advantages of Lasso regression: it can automatically select the most relevant variables and simplify the model by eliminating the less important ones.

Therefore, by examining these coefficients, you can understand which explanatory variables are considered important by the model and to what extent they influence the target variable. This information can be used to interpret the model and draw conclusions about the factors that influence the target variable in the context of your specific problem.

They can also be displayed graphically to better see the contribution of each variable.

import matplotlib.pyplot as plt

# Lasso model coefficients
lasso_coefficients = lasso.coef_

# Variable indices
indices = np.arange(len(lasso_coefficients))

# Plot coefficients
plt.figure(figsize=(10, 5))
plt.bar(indices, lasso_coefficients, color='b')
plt.xlabel('Variable Index')
plt.ylabel('Coefficient')
plt.title('Lasso Regression Model Coefficients')

# Add a red line for zero coefficients
plt.axhline(y=0, color='r', linestyle='--')

plt.xticks(indices)
plt.grid(True)
plt.show()

Lasso Regression - coefficients

It can be clearly seen that features 4 and 9 (indexes 3 and 8) are which in our synthetic dataset are equivalent to the two columns of X_train.

print(X_train[:][3])
print(X_train[:][8])

Returning to the results obtained from our model, we can evaluate the goodness of the prediction graphically using the following code:

import matplotlib.pyplot as plt
import numpy as np

# Plot predicted values vs actual values
plt.scatter(y_test, y_pred)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)  # Red dashed diagonal
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs Predicted Values (Lasso Regression)")
plt.show()

By executing this you will obtain a graph in which the predicted values are compared to the real values. The points on the graph must remain as close as possible to the dotted red diagonal (predicted value = real value). The distribution of these points around this line for the entire extension shows the goodness of the model in making predictions.

Lasso Regression - scatter plot 01

Real example of Linear Regression with the scikit-learn diabetes dataset

In the previous example we used synthetic data to show how a linear regression works. Now we will move on to a real dataset provided by the scikit-learn library and used to test the models’ ability to predict outcomes: the Diabetes dataset.

This dataset is widely used for evaluating the performance of regression models. It contains diabetes-related information for 442 patients, along with disease progression after one year, measured via a continuous response variable. The dataset contains only 442 instances, with 10 predictor variables. Despite its small size, the dataset is realistic and represents a typical regression problem where you want to predict disease progression based on different clinical measurements. Predictor variables in the dataset include characteristics such as age, gender, body mass index, and six blood serum measurements. These variables cover a range of information relevant to diabetes progression. Due to its size and the presence of a continuous response variable, the dataset is suitable for evaluating the performance of regression models. You can train different regression models, such as Linear Regression, Lasso Regression, Ridge Regression, and others, and evaluate their performance using cross-validation techniques or simply by splitting the data into training sets and test sets.

We then load the diabetes dataset. To have a detailed description of the dataset we can use diabetes.DESCR.

from sklearn.datasets import load_diabetes

# Load the diabetes dataset
diabetes = load_diabetes()

# Display dataset description
print("\nDiabetes dataset description:")
print(diabetes.DESCR)

One way to view and manage its content is to use pandas dataframes.

import pandas as pd

# Create a DataFrame with the data and column names
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)

# Add target column to the DataFrame
df['target'] = diabetes.target

# Display the first 5 rows of the DataFrame
df.head()

With the head() function you get the first 5 rows of the dataframe, enough to take a look and understand the content of the dataset and how it is structured.

LASSO regression - diabetes dataset

The last column represents the target. This value in the diabetes dataset is the progression of diabetes disease after one year of treatment, measured by a continuous variable. This variable represents a quantitative measure of disease progression and is used as a response or target variable in regression models. The goal is to use the other explanatory variables in the dataset to predict this target variable, in order to understand which factors influence the progression of diabetes.

Essentially, our goal is to use the information provided by the other variables in the dataset (such as age, gender, body mass index, and blood serum measurements) to predict the progression of the diabetes disease represented by the target. This allows us to better understand the factors that influence the progression of diabetes and can help in the diagnosis and treatment of the disease.

Now let’s apply the Lasso linear regression model. We divide the dataset into training set (80%) and testing set (20%) and then use the first for model learning and the second for evaluating predictions.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error

X = diabetes.data
y = diabetes.target

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create the Lasso regression model
alpha = 0.1  # regularization parameter
lasso = Lasso(alpha=alpha)

# Train the model on the training set
lasso.fit(X_train, y_train)

# Evaluate the model on the test set
y_pred = lasso.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

# Print the model coefficients
print("Coefficients:", lasso.coef_)

Executing you get the following results:

Mean Squared Error: 2798.193485169719
Coefficients: [   0.         -152.66477923  552.69777529  303.36515791  -81.36500664
   -0.         -229.25577639    0.          447.91952518   29.64261704]

In other words, the characteristics of the diabetes dataset corresponding to these null coefficients are considered irrelevant for predicting disease progression.

In your case, the coefficients that have been set to zero are associated with features with indices 0, 5, and 7. Since feature indices in Python start from zero, the corresponding features are:

  • Characteristic 0: age
  • Feature 5: s1 (totChol, total cholesterol)
  • Feature 7: S3 (HDL, High Density Lipoprotein)

These characteristics may not significantly contribute to disease progression in the dataset, and therefore the Lasso regression model excluded them thus reducing the complexity of the model.

This behavior is typical of Lasso regression, since L1 regularization induces sparsity, i.e. causes some coefficients to become exactly zero. This makes Lasso regression particularly useful for selecting variables and creating simpler, more interpretable models.

If we also want a graphical evaluation:

import matplotlib.pyplot as plt
import numpy as np

# Plot of predicted values vs actual values
plt.scatter(y_test, y_pred)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)  # Red dashed diagonal
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs Predicted Values (Lasso Regression)")
plt.show()

That by running this code you get the following graph:

Lasso Regression - scatter plot 02

Evaluating the LASSO Regression Model

We have seen how to graphically evaluate the results of the model’s prediction, observing how far the points move away from the central diagonal. But in this regard, there are several metrics that can be used to evaluate the goodness of the regression model. Some of the common metrics include:

  • Mean Squared Error (MSE): It is the average of the squares of the errors between the predicted values and the actual values. The lower the MSE value, the better the model.
  • Root Mean Square Error (RMSE): It is the square root of the MSE and provides an estimate of the dispersion of errors. As with MSE, the lower the RMSE value, the better the model.
  • Mean Absolute Error (MAE): It is the average of the absolute errors between the predicted values and the actual values. It is less sensitive to outliers than MSE.
  • Coefficient of Determination (( R^2 ) or R-squared): Indicates the proportion of variance in the dependent variable that is explained by the independent variables in the model. It can range from 0 to 1, where 1 indicates a perfect fit of the model to the data.

Regarding Lasso regression, usually MSE, RMSE and R-squared are the most commonly used metrics to evaluate the goodness of the model. For example, in the code provided above, we used MSE to evaluate model performance. However, it is always a good idea to use more than one metric to get a more complete assessment of your model’s performance.

Let’s see how to calculate the MSE, RMSE, MAE and ( R^2 ) metrics in the context of the Lasso regression model we created:

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)

# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)
print("Root Mean Squared Error (RMSE):", rmse)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error (MAE):", mae)

# Calculate R-squared (Coefficient of Determination)
r2 = r2_score(y_test, y_pred)
print("R-squared (R^2):", r2)

Executing we obtain the following metric values:

Mean Squared Error (MSE): 2798.193485169719
Root Mean Squared Error (RMSE): 52.897953506442185
Mean Absolute Error (MAE): 42.85442771664998
R-squared (R^2): 0.4718547867276227

To evaluate the results obtained, we can interpret each of the metrics in the following way:

  • Mean Squared Error (MSE): The MSE value is approximately 2798.19. Since the MSE measures the average of the squared errors, a lower value is preferable. However, the judgment of MSE depends on the context of the problem and the units of the target variable. In general, we can say that an MSE of this order of magnitude could indicate that the model does not have good accuracy and that errors can be quite high.
  • Root Mean Squared Error (RMSE): The RMSE is approximately 52.90. Since the RMSE is the square root of the MSE, it measures the standard deviation of the model errors. A lower RMSE indicates a better fit of the model to the data. However, here too, the evaluation depends on the context. In general, an RMSE of this order of magnitude would suggest that the model has some discrepancy from the actual data.
  • Mean Absolute Error (MAE): The MAE value is approximately 42.85. The MAE is the average of the absolute errors between the predicted and actual values. A lower MAE is preferable, as it indicates that the model has better accuracy. An MAE of this order of magnitude suggests that, on average, the model makes an error of about 42.85 units in its predicted values.
  • R-squared (( R^2 )): The coefficient ( R^2 ) is approximately 0.47. This value represents the proportion of variance in the dependent variable that is explained by the independent variables in the model. A value closer to 1 indicates a better model. However, an ( R^2 ) value of approximately 0.47 suggests that the model explains only part of the variability in the target data.

Overall, based on these metrics, we can conclude that the Lasso regression model may not be very accurate or robust for predicting disease progression in the considered diabetic dataset. You may need to explore other modeling techniques or tune model parameters to improve performance.

When to use Lasso for Linear Regression problems

The Lasso method is particularly useful in several contexts of linear regression problems. Here are some cases where the Lasso method may be an appropriate choice:

  • Selection of variables: When you are dealing with a large number of explanatory variables and you want to identify the most important ones for predicting the outcome, the Lasso method can be used to automatically select the relevant variables. The L1 penalty applied by Lasso regression causes some coefficients to become exactly zero, thus eliminating less informative variables and simplifying the model.
  • Dimensionality Control: In situations where you want to reduce the complexity of the model and prevent overfitting, the Lasso method can be used to reduce the dimensionality of the model. By reducing the number of variables used in the model, the possibility of overfitting is reduced and the model can be generalized to new data.
  • Interpretation of Coefficients: Because the Lasso method tends to drive some model coefficients to zero, the resulting model is often more interpretable. This is especially useful when you want a clear understanding of the contribution of the explanatory variables to the target variable.
  • Prediction and model performance: Despite the regularization introduced by the L1 penalty, the Lasso method can still provide good prediction performance, especially in situations where variable selection is important and the number of observations is limited compared to the number of variables.

Overall, the Lasso method is a great choice when you want to select variables, control model dimensionality, and get clearer interpretations of model coefficients, while maintaining good forecasting performance.

Leave a Reply