Gradient Boosting in Machine Learning with Python

Gradient Boosting header

The Gradient Boosting algorithm is a machine learning technique that relies on the sequential building of weak models, often decision trees, to build a stronger model capable of tackling regression and classification problems. The main goal of gradient boosting is to reduce model error by combining the weaknesses of individual models.

[wpda_org_chart tree_id=22 theme_id=50]

The Gradient Boosting Algorithm

Here’s how the Gradient Boosting algorithm works in general:

  1. Model initialization: You start with a simple model, known as a “base model” or “weakly learned model”. In regression problems, this might be a single constant value (for example, the mean of the targets in the training set). In binary classification problems, it could be the log-odds of class probabilities.
  2. Residual calculation: Residuals between the current model predictions and the true target values are calculated. These residuals represent the remaining error in the model.
  3. Creation of the new weakly learned model: A new weakly learned model is trained to predict the residuals calculated in the previous step. This model will try to catch the “missings” in the previous model’s prediction.
  4. Combined model update: The predictions of the new model are multiplied by a learning rate and then added to the predictions of the previous model. This step allows you to update the combined model to get closer to the correct predictions.
  5. Iterations: Steps 2-4 are repeated for a certain number of iterations or until the error decreases significantly
  6. Final model: The final model is a weighted combination of all the weakly learned models. These weak models have been trained so that the residuals of each model’s prediction are corrected by subsequent models.

The key idea of gradient boosting is that each new, weakly learned model focuses on the mistakes made by previous models. The process of combining these weak models progressively improves the performance of the overall model.

In Python, you can use libraries like scikit-learn, XGBoost, LightGBM, and CatBoost to implement the Gradient Boosting algorithm. These libraries offer optimized implementations and allow you to customize several parameters to tailor the model to your specific needs

A bit of history

Gradient Boosting is a machine learning technique that relies on combining weak models (often decision trees) in a sequential way to create a stronger model. It is one of the most powerful and effective techniques for regression and classification. Here is an overview of the history of gradient boosting:

90s: The idea of boosting was introduced by Robert Schapire in 1990. He developed the “Adaptive Boosting” (AdaBoost) algorithm, which was one of the first boosting algorithms. AdaBoost focuses on classification problems and builds a strong model by combining weak models, each of which is iteratively trained to focus on the difficult examples.

Late 90s – Early 2000s: Throughout the 1990s and early 2000s, the boosting approach was further developed and improved. In 2001, Jerome Friedman introduced the Gradient Boosting Machine (GBM) algorithm, which extended the concept of boosting to regression problems as well. The GBM approach is based on the optimization of the loss function through gradient descent.

2000s:The idea of Gradient Boosting has continued to evolve as variations and improvements have been introduced. In 2003, Jerome Friedman, Trevor Hastie, and Robert Tibshirani developed the “Gradient Boosting Regression Trees” (GBRT) algorithm, which uses weak decision trees as its base models. This has made the approach even more powerful and flexible.

Over the next few years, other variations and implementations of gradient boosting were developed. For example, XGBoost (eXtreme Gradient Boosting) was introduced in 2006, which further improved the performance and execution speed of the approach. Later, other frameworks and libraries such as LightGBM and CatBoost were developed, offering even better performance and new features.

Today, Gradient Boosting and its variants are widely used in machine learning and data science practice. These techniques are recognized for their ability to handle complex data, reduce overfitting, improve generalization, and obtain accurate predictions on a broad range of regression and classification problems.

Suggested Book

If you are interested in Machine Learning with Python I suggest you read this book:

Machine Learning with Python Cookbook

Gradient Boosting with the scikit-learn library

In Python, you can use the scikit-learn library to implement the Gradient Boosting algorithm and create models based on it. This algorithm can be used for two different Machine Learning approaches.

  • Classification
  • Regression

Classification with Gradient Boosting

Here is an example of how you can use scikit-learn to build and train a Classification Gradient Boosting model using the Breast Cancer Wisconsin dataset:

Step 1:Import the necessary libraries

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

In this section, we are importing the libraries needed to build and train the Classification Gradient Boosting model.

Step 2: Load the dataset and split the data

#Load the Breast Cancer Wisconsin dataset as an example<code>
breast_cancer = load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target

#Divide the dataset into training and test sets<code>
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Here we are loading the Breast Cancer Wisconsin dataset using the load_breast_cancer() function and splitting the data into training and test sets using the train_test_split() function.

Step 3: Crea e addestra il modello Gradient Boosting

#Create the Gradient Boosting Classifier model
clf = GradientBoostingClassifier(random_state=42)

#Train the model on the training set
clf.fit(X_train, y_train)

In this step, we are creating a GradientBoostingClassifier object and training it on the training set using the fit() method.

Step 4: Make predictions and calculate accuracy

#Make predictions on the test set<code>
predictions = clf.predict(X_test)

#Calculate the accuracy of forecasts<code>
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)

In this step, we are making predictions on the test set using the trained model’s predict() method and then calculating the accuracy of the predictions using the accuracy_score() function. By running it you get the accuracy value.

Accuracy: 0.956140350877193

You can use visualizations that allow you to better understand the validity or otherwise of the model just used.

import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc, confusion_matrix
import seaborn as sns

# Make probability predictions on the test set
probabilities = clf.predict_proba(X_test)[:, 1]

# Calculate the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, probabilities)
roc_auc = auc(fpr, tpr)

# View the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = {:.2f})'.format(roc_auc))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve of the Gradient Boosting Classifier')
plt.legend(loc='lower right')
plt.show()

# View the confusion matrix
cm = confusion_matrix(y_test, predictions)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='g', cmap='Blues', xticklabels=breast_cancer.target_names, yticklabels=breast_cancer.target_names)
plt.xlabel('Forecasts')
plt.ylabel('True Values')
plt.title('Confusion Matrix of the Gradient Boosting Classifier')
plt.show()

# View the importance of features
feature_importances = clf.feature_importances_
feature_names = breast_cancer.feature_names

plt.barh(range(len(feature_importances)), feature_importances, align='center')
plt.yticks(range(len(feature_names)), feature_names)
plt.xlabel('Importance of Features')
plt.title('Importance of Features in the Gradient Boosting Classifier')
plt.show()

The code just inserted, if executed, will produce three different types of views. The first visualization is an ROC Curve: A graphical representation of the classifier’s performance, showing the false positive rate versus the true positive rate as the decision threshold varies.

Gradient Boosting - ROC curve

Confusion Matrix: A matrix graph showing the number of correct and incorrect predictions for each class. This can help you better understand the performance of your model.

Gradient Boosting - Confusion Matrix

Feature Importance: A horizontal bar graph showing the relative importance of each feature in the Gradient Boosting Classifier model.

Gradient Boosting - Importance Features

These steps combined form a complete example of how to use scikit-learn to build and train a Classification Gradient Boosting model for a classification problem. You can further customize the model using the Gradient Boosting Classifier hyperparameters to tailor it to your needs.

Python Data Analytics

If you want to delve deeper into the topic and discover more about the world of Data Science with Python, I recommend you read my book:

Python Data Analytics 3rd Ed

Fabio Nelli

Regression with Gradient Boosting

Here is an example of how you can use scikit-learn to build and train a Gradient Boosting Regression model using the Diabetes dataset:

Step 1: Import the necessary libraries

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error

In this section, we are importing the libraries needed to build and train the Regression Gradient Boosting model.

Step 2: Load the dataset and split the data

#Upload the Boston Housing dataset as an example
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target

#Divide the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Here we are loading the Diabetes dataset using the load_boston() function and splitting the data into training and test sets using the train_test_split() function.

Step 3: Create and train the Regression Gradient Boosting model

#Create the Regression Gradient Boosting model<code>
reg = GradientBoostingRegressor(random_state=42)

#Train the model on the training set<code>
reg.fit(X_train, y_train)

In this step, we are creating a GradientBoostingRegressor object and training it on the training set using the fit() method.

Step 4: Make predictions and calculate the mean squared error

#Make predictions on the test set<code>
predictions = reg.predict(X_test)

#Calculate the root mean squared error of the predictions<code>
mse = mean_squared_error(y_test, predictions)
print("Mean Squared Error:", mse)

In this step, we are making predictions on the test set using the trained model’s predict() method and then calculating the mean squared error of the predictions using the mean_squared_error() function. Running the code gives the following value of MSE.

Mean Squared Error: 2898.4366729135227

Also in this case of Gradient Boosting regression it is possible to provide visualizations from which it is possible to evaluate the validity of a newly developed model.

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Displays the scatterplot between actual and predicted values
plt.scatter(y_test, predictions)
plt.xlabel('Actual Values')
plt.ylabel('Predictions')
plt.title('Comparison between Actual Values and Forecasts')
plt.show()

# View the importance of features
feature_importances = reg.feature_importances_
feature_names = diabetes.feature_names

plt.barh(range(len(feature_importances)), feature_importances, align='center')
plt.yticks(np.arange(len(feature_names)), feature_names)
plt.xlabel('Importance of Features')
plt.title('Importance of Features in Gradient Boosting Regressor')
plt.show()

# View the error distribution
errors = y_test - predictions
sns.histplot(errors, kde=True)
plt.xlabel('Errors')
plt.ylabel('Frequency')
plt.title('Error Distribution in the Gradient Boosting Regressor')
plt.show()

Running this code will give you three different types of views. The first visualization is a Scatterplot. A scatterplot showing the relationship between actual values and model predictions. Ideally, the dots should align with the diagonal line, indicating a good forecast.

Gradient Boosting - Comparison chart

The second view represents Feature Importance, which is a horizontal bar graph that shows the relative importance of each feature in the Gradient Boosting Regressor model.

Gradient Boosting - Importance features regression

The last visualization is the Error Distribution: a histogram that shows the distribution of errors, helping you understand the accuracy of the model in different predictions.

Gradient Boosting - Error distribution

These steps combined form a complete example of how to use scikit-learn to build and train a regression gradient boosting model for a regression problem. You can further customize the model using the hyperparameters of the Gradient Boosting Regressor to tailor it to your needs.

Some datasets for practicing classification problems with scikit-learn

If you want to get some Machine Learning practice working with classification problems, there are ready-made datasets to practice on. Here are some datasets you can use for practical examples with Classification Gradient Boosting using scikit-learn without using the Iris dataset:

  1. Breast Cancer Wisconsin (Diagnostic) Dataset: This dataset contains features extracted from images of fine aspirates of breast lumps and the goal is to classify whether a tumor is benign or malignant.
    • Upload the dataset: from sklearn.datasets import load_breast_cancer
  2. Wine Dataset: This dataset contains chemical measurements of wines from three different varietals. The goal is to classify the variety of wine.
    • Upload the dataset: from sklearn.datasets import load_wine
  3. Digits Dataset: This dataset contains images of handwritten digits and the goal is to classify which digit is represented.
    • Upload the dataset: from sklearn.datasets import load_digits
  4. Heart Disease UCI Dataset: This dataset contains clinical information for patients and the goal is to classify whether or not a patient has heart disease.
  5. Bank Marketing Dataset: This dataset contains information on bank marketing campaigns and the objective is to classify whether or not a customer will subscribe to a term deposit.
  6. Titanic Dataset: This dataset contains information about the passengers of the Titanic and the goal is to classify whether or not a passenger will survive.

To use one of these datasets, you can import the appropriate dataset using the load_* function provided by sklearn.datasets. Be sure to read the documentation associated with the dataset to understand the characteristics, the output variable, and how to prepare the data for Gradient Boosting model training.

Some datasets for practicing regression problems with scikit-learn

Here are some datasets you can use for practical examples with Regression Gradient Boosting using scikit-learn:

  1. Boston Housing Dataset: This dataset contains data on homes in the Boston area and the goal is to predict median home value.
    • Load dataset: from sklearn.datasets import load_boston
  2. Diabetes Dataset: This dataset contains medical measurements related to diabetes and the goal is to predict disease progression one year later.
    • Load dataset: from sklearn.datasets import load_diabetes
  3. California Housing Dataset: This dataset contains real estate data in California and the goal is to predict the median home value in different areas.
  4. Energy Efficiency Dataset: This dataset contains information on the energy performance of buildings and the objective is to predict energy efficiency.
  5. Concrete Compressive Strength Dataset: This dataset contains data on the compressive strength of concrete and the goal is to predict the compressive strength.
  6. Combined Cycle Power Plant Dataset: This dataset contains data on energy production in a power plant and the goal is to predict energy efficiency.

To use one of these datasets, you can import the appropriate dataset using the load_* function provided by sklearn.datasets. Be sure to read the documentation associated with the dataset to understand the characteristics, the output variable, and how to prepare the data for training the Gradient Boosting Regression model.

Leave a Reply