Decision Trees as a learning model in Machine Learning

Machine Learning with Python - Decision Trees h

I Decision Trees

A Decision Tree is a machine learning model that represents a series of logical decisions made based on attribute values. It is a tree structure which is used to make decisions or make predictions from the input data.

A decision tree consists of nodes and branches. Nodes represent decisions or tests on an attribute, while branches represent the possible consequences of a decision or test. Decision trees are used for both classification and regression problems.

Here’s how the process of creating a Decision Tree works:

  1. Attribute selection: The decision tree building algorithm chooses the best attribute to use as the root node. This choice is made based on criteria such as entropy, information gain or Gini impurity. The chosen attribute is used to divide the dataset into smaller subsets.
  2. Data Breakdown: The data is broken down based on the values of the chosen attribute. Each attribute value generates a branch in the tree, and the data is assigned to the corresponding branches.
  3. Iteration: The process of selecting the attribute and splitting the data is repeated for each subset of data in each internal node. This process continues until a stopping criterion is met, such as a maximum tree depth or a minimum number of samples in a node.
  4. Leaf Nodes: Once the splitting process reaches the leaf nodes, i.e. the final nodes of the tree, classification labels (in the case of a classification problem) or prediction values (in the case of a regression problem) are assigned.

The advantages of decision trees include ease of interpretation, as decisions are intuitively represented as logical flows, and the ability to handle both numerical and categorical data. However, decision trees can be prone to overfitting, especially when they are deep and complex.

Decision trees can be further improved using techniques such as pruning, which reduces the complexity of the tree to improve generalization, and ensemble learning, where multiple decision trees are combined to form stronger models, such as Random Forest and Gradient Boosting.

A bit of history

Decision trees have a long and interesting history in machine learning and artificial intelligence. Their evolution has led to the development of more advanced techniques, such as Random Forests and Gradient Boosting. Here is an overview of the history of decision trees in machine learning:

1960s: The idea behind decision trees has roots in the 1960s. The concept of “decision programs” was introduced by Hunt, Marin and Stone in 1966. The goal was to create algorithms that could learn how to make decisions through data-driven logical rules.

1970s: In the 1970s, the concept of decision trees was further developed by Michie and Chambers. They introduced the idea of creating decision trees that could be used to make decisions about classification problems.

1980s: In the 1980s, the concept of decision trees continued to evolve as new machine learning techniques were introduced. ID3 (Iterative Dichotomiser 3), developed by Ross Quinlan in 1986, is one of the first machine learning algorithms based on decision trees. ID3 used the concept of information gain to select the best attributes for splitting.

1990s: In the 1990s, the approach to decision trees was further improved with the introduction of new algorithms and techniques. C4.5, introduced by Ross Quinlan in 1993, improved and extended ID3, allowing you to handle numeric and missing attributes and introducing tree pruning to improve generalization.

2000s and beyond: In the 2000s, interest in decision trees increased further. Algorithms such as CART (Classification and Regression Trees) have been developed, which can be used for both classification and regression problems. Furthermore, ensemble learning techniques based on decision trees have been introduced, such as Random Forests (2001) by Leo Breiman and Gradient Boosting (2001) by Jerome Friedman.

Today, decision trees and their variants are widely used in machine learning and data science. They are valued for their ease of interpretation, ability to handle heterogeneous data, and ability to handle both classification and regression problems. Decision trees are often combined into ensemble methods such as Random Forests and Gradient Boosting to further improve model performance.

Using Decision Trees with scikit-learn

Decision Trees can be used for two Machine Learning problems:

  • Classification
  • Regression

In Python, you can use the scikit-learn library to build and train decision trees using the DecisionTreeClassifier classes for classification and DecisionTreeRegressor for regression.

Classification with Decision Trees with scikit-learn

Here is an example of how you can use scikit-learn to create and train a Decision Tree Classifier using the Breast Cancer Wisconsin dataset.

Step 1: Import the necessary libraries

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

In this section, we are importing the libraries needed to build and train the Decision Tree Classifier.

Step 2: Load the dataset and split the data

# Load the Breast Cancer Wisconsin dataset as an example
breast_cancer = load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target

# Divide the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Here we are loading the Breast Cancer Wisconsin dataset using the load_breast_cancer() function and splitting the data into training and test sets using the train_test_split() function.

Step 3: Create and train the Decision Tree classifier

# Create the Decision Tree classifier
clf = DecisionTreeClassifier(random_state=42)

# Train the classifier on the training set
clf.fit(X_train, y_train)

In this step, we are creating a DecisionTreeClassifier object and training it on the training set using the fit() method.

Step 4: Make predictions and calculate accuracy

# Make predictions on the test set
predictions = clf.predict(X_test)

# Calculate the accuracy of forecasts
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)

In this step, we are making predictions on the test set using the predict() method of the trained classifier and then calculating the accuracy of the predictions using the accuracy_score() function.

These steps combined form a complete example of how to use scikit-learn to create and train a Decision Tree Classifier for a classification problem. You can further customize the model using the Decision Tree Classifier hyperparameters to tailor it to your needs.

Decision Tree Regression with scikit-learn

Here is an example of how you can use scikit-learn to create and train a Decision Tree Regressor using the Boston Housing dataset:

Step 1: Import the necessary libraries

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

In this section, we are importing the libraries needed to build and train the Decision Tree Regressor.

Step 2: Load the dataset and split the data

# Upload the California Housing dataset as an example
housing = fetch_california_housing()
X = housing.data
y = housing.target

# Divide the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Here we are loading the California Housing dataset using the fetch_california_housing() function and splitting the data into training and test sets using the train_test_split() function.

Step 3: Create and train the Decision Tree regressor

# Create the Decision Tree regressor
reg = DecisionTreeRegressor(random_state=42)

# Train the regressor on the training set
reg.fit(X_train, y_train)

In this step, we are creating a DecisionTreeRegressor object and training it on the training set using the fit() method.

Step 4: Make predictions and calculate the mean squared error

# Make predictions on the test set
predictions = reg.predict(X_test)

# Calculate the root mean squared error of the predictions
mse = mean_squared_error(y_test, predictions)
print("Mean Squared Error:", mse)

In this step, we are making predictions on the test set using the trained regressor’s predict() method and then calculating the mean squared error of the predictions using the mean_squared_error() function.

These steps combined form a complete example of how to use scikit-learn to create and train a Decision Tree Regressor for a regression problem. You can further customize the model using the Decision Tree Regressor hyperparameters to tailor it to your needs.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.