Supervised Learning in Machine Learning

Machine Learning - Supervised Learning header

Supervised Learning is a machine learning paradigm in which a model is trained on a labeled training dataset. Each example in the training set consists of a pair of input and associated output, where the output is the “correct answer” or label provided by the supervisor.

[wpda_org_chart tree_id=19 theme_id=50]

The organization of data in Supervised Learning

The main goal of Supervised Learning is to learn a mapping between inputs and outputs so that the model can make accurate predictions on new data for which the correct output is not known.

  • Inputs (Features): Inputs, also known as characteristics or independent variables, represent the information the model has to make predictions. For example, in a house price prediction problem, characteristics might include number of bedrooms, square footage, location, etc.
  • Outputs (Labels or Dependent Variables): The outputs are the correct answers associated with the inputs. For example, in the house price problem, the output would be the actual price of the house.
  • Training Set: This is the dataset used to train the model. Each example in the training set contains an input associated with its correct label. The quality and representativeness of this set are crucial to the model’s performance.
  • Test Set: After training, the model is evaluated on a separate dataset called the test set, which contains similar examples but with unknown output. This is to evaluate how well the model generalizes to new data.

Suggested Book

If you are interested in Machine Learning with Python I suggest you read this book:

Machine Learning with Python Cookbook

An example: let’s prepare a data dataset with Python

Data preparation is a crucial phase in the Supervised Learning process. We’ll use a classification example using Python with the scikit-learn library to illustrate how to perform some of the common data preparation tasks. In this example, let’s imagine that we are working with an Iris flower dataset, which is a very common classification dataset.

Suppose we have an initial dataset like the following:

import pandas as pd

# Creating a sample DataFrame
data = {'sepal_length': [5.1, 4.9, 4.7, 4.6, 5.0],
        'sepal_width': [3.5, 3.0, 3.2, 3.1, 3.6],
        'petal_length': [1.4, 1.4, 1.3, 1.5, 1.4],
        'petal_width': [0.2, 0.2, 0.2, 0.2, 0.2],
        'species': ['setosa', 'setosa', 'setosa', 'setosa', 'setosa']}

df = pd.DataFrame(data)

In this example, we are tackling a classification problem where we want to predict the species of the flower (the target variable “species”) based on some characteristics such as the length and width of the sepals and petals.

Here are some common data preparation tasks in Python:

  1. Data Exploration
# Displays the first few rows of the DataFrame
print(df.head())

# Count the occurrences for each species
print(df['species'].value_counts())

# Statistical description of the DataFrame
print(df.describe())

Executing it, you get:

   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa
species
setosa    5
Name: count, dtype: int64
       sepal_length  sepal_width  petal_length  petal_width
count      5.000000     5.000000      5.000000          5.0
mean       4.860000     3.280000      1.400000          0.2
std        0.207364     0.258844      0.070711          0.0
min        4.600000     3.000000      1.300000          0.2
25%        4.700000     3.100000      1.400000          0.2
50%        4.900000     3.200000      1.400000          0.2
75%        5.000000     3.500000      1.400000          0.2
max        5.100000     3.600000      1.500000          0.2

2. Transforming Labels into Numbers:

from sklearn.preprocessing import LabelEncoder

# Initialize the label encoder
le = LabelEncoder()

# Transform the 'species' variable into numbers
df['species_encoded'] = le.fit_transform(df['species'])

3. Splitting Data into Training and Test Sets:

from sklearn.model_selection import train_test_split

# Splits data into training and test sets
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
y = df['species_encoded']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

4. Standardization or Normalization of Characteristics:

from sklearn.preprocessing import StandardScaler

# Initialize the standardizer
scaler = StandardScaler()

# Standardize features
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

These are just a few examples of data preparation activities. Depending on your specific problem, you may face other challenges, such as dealing with missing values, creating new features, or dealing with outliers. The scikit-learn library provides many useful functions and tools to perform these tasks and prepare the data for model training.

Regression and classification in supervised learning

In the context of supervised learning, where algorithms learn from labeled data, two key problems emerge: regression and classification.

Regression is geared towards predicting continuous values. Imagine trying to estimate the price of a house based on its characteristics, or predicting the output of a factory by considering input variables such as labor force and raw material supply. In this case, the output variables are continuous, and algorithms such as Linear Regression or Support Vector Regression can be used to model and predict these values.

On the other hand, classification focuses on predicting the category or class to which a given input belongs. A concrete example is the categorization of emails as spam or not spam, or the diagnosis of a disease based on specific symptoms. Here, the output variables are categorical or class-based, and algorithms such as Decision Trees or Support Vector Machines can be employed to correctly classify the data.

In both contexts, training the algorithm involves the presentation of labeled data, allowing the algorithm to learn the relationship between the inputs and the associated labels. This process allows the algorithm to make predictions on new data, helping to solve complex problems in different industries. The choice between regression and classification depends on the nature of the problem and the type of output desired, with each approach offering a powerful and flexible framework for analyzing and predicting data.

Common Supervised Learning Algorithms

Supervised learning algorithms are well suited to solving classification problems, where the goal is to assign a label or class to an observation, and regression problems, where the goal is to predict a continuous numerical value. These algorithms all require labeled data to train the model. Each example in the training set must have an input associated with a known output or label.

  • Linear Regression: Used to predict continuous values, such as the price of a house.
  • Logistic Regression: Suitable for binary classification problems, such as predicting whether an email is spam or not spam.
  • Support Vector Machines (SVM): Used for classification or regression problems, trying to find the best hyperplane of separation between classes.
  • Decision Trees and Random Forests: Excellent for classification and regression, based on cascading decisions.
  • K-Nearest Neighbors (K-NN): Classifies inputs based on the majority of nearest “neighbor” labels.

During the training phase, the model is exposed to a labeled dataset, and the algorithms try to optimize the model parameters so that the difference between the predictions and the correct outputs is minimized. Algorithms receive feedback on the quality of their predictions through a loss function or an evaluation criterion. The goal is to reduce this loss during the training process.

Python Data Analytics

If you want to delve deeper into the topic and discover more about the world of Data Science with Python, I recommend you read my book:

Python Data Analytics 3rd Ed

Fabio Nelli

Phases of the Supervised Learning Process

The points we saw in the previous section can be better explained if we take into account the learning process phases of these models.

  1. Data Collection: Acquisition and preparation of a representative and meaningful data set.
  2. Model Selection: Choose an appropriate Supervised Learning algorithm for the problem.
  3. Model Training: Use the training set to teach the model to make predictions.
  4. Model Evaluation: Test the model on a separate dataset to evaluate its performance.
  5. Prediction: Use the trained model to make predictions on new data.

All the algorithms listed above have in common the need to have a labeled dataset both during the learning phase and during the testing phase. And they all follow the process steps listed above.

Supervised learning is widely used in a wide range of practical applications, such as image recognition, machine translation, medical diagnosis and much more. The key to its success lies in its ability to learn complex patterns and relationships in data, allowing computers to perform complex tasks with high predictive performance.

Leave a Reply