Ridge Regression for Linear Regression with scikit-learn in Machine Learning

Ridge Regression header

Ridge Regression is a supervised learning technique that adds a regularization term called “ridge penalty” to the objective function. This helps prevent over-sensitivity to training data, reducing overfitting. Ridge regularization is controlled by a parameter λ, which balances between reducing model complexity and minimizing error.

Ridge Regression

Ridge Regression was first introduced by Hoerl and Kennard in 1970 as a method for dealing with multicollinearity problems in linear regression models. Initially, it was proposed as a technique to improve coefficient estimates in cases where independent variables are highly correlated with each other, causing instability in coefficient estimates and amplification of variance.

The name “Ridge Regression” comes from the fact that the method adds a “ridge” to the correlation matrix to improve the stability of the coefficient estimates. This ridge is achieved through the addition of a regularization term, also known as the L2 penalty, which consists of the sum of the squares of the coefficients multiplied by a regularization parameter λ (also called alpha in some implementations such as in Scikit-learn).

Ridge Regression has become an important technique in the field of statistics and machine learning, particularly for addressing problems of multicollinearity and overfitting in linear models. It is one of the most common regularization techniques along with Lasso Regression (which uses an L1 penalty), and together they form the basis of regularized linear regression.

In Ridge Regression the objective function to be minimized is given by:

 J(\theta) = \sum_{i=1}^{m} (y^{(i)} - h_{\theta}(x^{(i)}))^2 + \lambda \sum_{j=1}^{n} \theta_j^2

Where:

( m ) is the number of examples in the training dataset

( n ) is the number of features

 \theta is the vector of model coefficients

 h_{\theta}(x) is the hypothesis function

 \lambda is the regularization parameter (note: typically ( \lambda ) is positive)

The regularization is represented by the  \lambda \sum_{j=1}^{n} \theta_j^2 term, which penalizes the  \theta coefficients for being too large, limiting thus the complexity of the model.

The  \lambda parameter controls the importance of regularization with respect to the accuracy of the model.

Ridge Regression with scikit-learn

Ridge Regression is provided by scikit-learn through the Ridge class for linear regression. This class implements linear regression with Ridge regularization using the L2 penalty. You can use the Ridge class to train Ridge Regression models and make predictions on new data. Here is an example of how to use Ridge Regression with scikit-learn:

import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

# Generate random data for the example
X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)

make_regression is a Scikit-learn function that generates random data for regression problems. It is useful for creating sample datasets for testing, experimentation, and proof of concept purposes.

In our example, we are using make_regression to generate a sample dataset.

  • n_samples: specifies the number of samples to generate in the dataset. In this case, we generated 100 samples.
  • n_features: Specifies the number of features (independent variables) to generate. Here we have generated only one feature.
  • noise: specifies the standard deviation of the Gaussian noise added to the target. The higher the value, the louder the noise. In this case, we set the noise to 10.
  • random_state: allows you to set the seed for random data generation, ensuring that the generated data is reproducible. In this case, we set the seed to 42 for reproducibility reasons.

make_regression returns two numpy arrays, X and y. X contains the randomly generated features, while y contains the corresponding target values (dependent variables).

In this case, we generated a dataset with a single feature (n_features=1) and an associated target.

Now we can split the data into training and test sets, trained a Ridge Regression model on the training data, and evaluated the model’s performance using the test data.

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the Ridge Regression model
ridge_regression = Ridge(alpha=1.0)  # alpha is the regularization parameter (equivalent to lambda)
ridge_regression.fit(X_train, y_train)

# Make predictions on the test data
y_pred = ridge_regression.predict(X_test)

# Plot the results
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, y_pred, color='blue', linewidth=3)
plt.title('Ridge Regression')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.show()

Executing the following graph is obtained:

Ridge Regression - scatter plot

Evaluation of the validity of the Model

To evaluate the validity of the Ridge Regression model created in the previous code, we can use several metrics to evaluate the model’s performance. Here are some common metrics we can calculate:

  • Mean Square Error (MSE): Mean square error measures the average of the squared errors between the values predicted by the model and the actual values in the test set. A lower MSE indicates a better fit of the model to the test data.
  • Coefficient of Determination  R^2 : The coefficient of determination provides a measure of how well the model fits the data. A value closer to 1 indicates a better model, while a value closer to 0 indicates that the model is not able to explain the variability in the data well.
  • Residuals Graph: We can also display the residuals, which are the differences between the actual values and the predicted values. A residuals plot should show a random distribution around zero with no obvious pattern. If we see a pattern in the residuals, it may indicate that the model is unable to fully capture the structure of the data.

These are just some of the metrics that can be used to evaluate the validity of the model. It is important to use more than one metric to get a complete view of your model’s performance. Let’s look at the mean square error (MSE) and coefficient of determination of our model

from sklearn.metrics import mean_squared_error, r2_score

# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

# Calculate the coefficient of determination (R²)
r2 = r2_score(y_test, y_pred)
print("R-squared:", r2)

By running you get the evaluation values:

Mean Squared Error: 105.78604284136125
R-squared: 0.9364639057612545

The results obtained provide some important information on the quality of the model:

Mean Squared Error (MSE): 105.7

This value indicates the average of the squared errors between the model predictions and the actual values in the test dataset. A lower MSE indicates that the model produces more accurate predictions. In your case, an MSE of about 105.79 suggests that, on average, the model’s predictions have an error of about 105.79 units squared compared to the true values.

Coefficient of Determination (R-squared): 0.936

This value represents the proportion of variance in the response data that is explained by the model. In other words, it is a measure of how well the model fits the data. The R-squared can range from 0 to 1, where a value closer to 1 indicates a model that explains more of the variance in the data. In your case, an R-squared of about 0.94 indicates that the model explains about 94% of the variance in the response data, which is a very good result.

Now let’s move on to the evaluation with the residual graph. To do this we add the following code:

residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.axhline(y=0, color='red', linestyle='--')
plt.title('Residual Plot')
plt.show()

Executing we obtain the following graph of residuals:

Ridge Regression - residual plot

The purpose of the residuals plot is to evaluate whether heterskedasticity or homoscedasticity problems exist.

Heteroskedasticity: Occurs when the variance of prediction errors is not constant across all levels of the predictor variable. In practice, this means that prediction errors tend to be larger or smaller depending on the values of the independent variables. Heteroscedasticity can negatively affect model accuracy and interpretation. For example, estimated coefficients may be biased and error estimates may be inaccurate. To address heteroskedasticity, you may need to transform variables or use robust estimation techniques that take variation in variance into account.

Homoscedasticity: Occurs when the variance of prediction errors is constant across all levels of the predictor variable. In this case, the prediction errors are uniformly distributed across all levels of the independent variables. The presence of homoscedasticity is desirable because it makes model parameter estimates and predictions more reliable. If the model exhibits homoscedasticity, it is not necessary to make specific corrections.

In our case we find that in the residuals plot most of the points are above the reference line, this could indicate the presence of heteroskedasticity in the data. Thus, this suggests that the variance of prediction errors is not constant. In this case, you may want to examine the data further and consider methods to address heteroskedasticity in order to improve the validity of the model.

When to use Ridge Regression?

Ridge Regression is a linear regression technique that incorporates an L2 regularization term to reduce overfitting and improve model generalization. Here are some situations where Ridge Regression might be preferable to other linear regression methods provided by Scikit-learn:

  • Multicollinearity of Predictors: If the dataset contains highly correlated predictor variables (multicollinearity), Ridge Regression may be preferable because it helps stabilize the coefficient estimates, thus avoiding the instability in the estimated coefficients that might occur in standard linear regression.
  • Controlling overfitting: When you want to control overfitting and prevent excessive model complexity, Ridge Regression is an appropriate choice due to the L2 regularization term that limits the magnitude of the coefficients.
  • High-dimensional datasets: In the presence of datasets with a large number of (high-dimensional) predictor variables, Ridge Regression may be preferable because it helps manage the trade-off between model complexity and fit to the data.
  • Robust Coefficient Estimates: Ridge Regression produces more stable and robust coefficient estimates than standard linear regression, especially when the dataset is noisy or contains data with high variance.
  • No feature selection: If you do not need to perform automatic feature selection and prefer to keep all features in the model, Ridge Regression may be a reasonable choice to manage model complexity.

However, it is important to note that the choice of regression method depends on the specifics of the problem, the nature of the data, and the objectives of the analysis. Other regression techniques such as standard linear regression, Lasso regression, or other forms of regularization may be more suitable in certain contexts. Therefore, it is advisable to carefully examine the characteristics of the dataset and test multiple models to determine which is most appropriate for your specific situation.

Leave a Reply