Logistic regression is a type of regression model used for binary classification problems, where the goal is to predict which of two classes an instance belongs to. Unlike linear regression, which predicts continuous values, logistic regression predicts probabilities that vary between 0 and 1. This is achieved by using a logistic (or sigmoid) function to transform the linear output into probabilities.
[wpda_org_chart tree_id=18 theme_id=50]
Logistic Regression
Logistic regression has an interesting history and was developed primarily to address binary classification problems. The concept of using the logistic function to model the probability of class membership was proposed by Joseph Berkson in 1944. During the 1950s and 1960s, the logistic regression approach was widely used in epidemiology to model the incidence of diseases based on risk variables. The method has been further developed and popularized in the context of statistical analyses, especially in medical and biological applications.
The Logistics Function (Sigmoid):
The logistics function is defined as:
where
The logistic function transforms values in a range between 0 and 1, resulting in a characteristic “S” shape. This function is critical in logistic regression because it converts the model output into probabilities.
The Logistic Regression Model:
Logistic regression models the probability of class membership using the logistic function. The relationship between the independent variables and the probability of belonging to a class is expressed as:
Where:
is the conditional probability that the instance belongs to class 1 given the vector of independent variables . are the weights associated with the independent variables. is the term bias.
The probability of belonging to class 0 is simply
Model training:
Training the model involves estimating the weights (
Validation and Testing
After training, the model is validated and tested on unseen data to evaluate its performance on new data.
Extensions
Logistic regression can be extended for multiclass classification problems using techniques such as multinomial logistic regression.
In summary, logistic regression is an effective and interpretable model for binary classification, used in a variety of domains such as medicine, finance, and sentiment analysis.
If you want to delve deeper into the topic and discover more about the world of Data Science with Python, I recommend you read my book:
Fabio Nelli
Logistic Regression Example with Python
Let’s take a logistic regression example using Python and the scikit-learn module. In this example, we will use an example dataset provided by scikit-learn called iris, which is a classification dataset with three classes of flowers.
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Load the iris dataset
iris = datasets.load_iris()
X = iris.data[:, :2] # Take only the first two features for display
y = (iris.target != 0).astype(int) # Consider only two classes (0 or non-0)
# Divide the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize features (important for logistic regression)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Build and train the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)
# Make predictions about the test set
y_pred = model.predict(X_test)
# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)
# View the decision boundary
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# View the result
plt.contourf(xx, yy, Z, cmap=plt.cm.RdYlBu, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', cmap=plt.cm.RdYlBu)
plt.xlabel('Sepal Length (standardized)')
plt.ylabel('Sepal Width (standardized)')
plt.title('Logistic Regression Decision Boundary')
plt.show()
Suggested book:
In this example, the logistic regression is trained on two of the classes of the iris dataset (setosa and versicolor) and the decision boundary is displayed on the first two features (sepal length and width). Model accuracy and other performance evaluation metrics are printed. By running the code you obtain the Decision Boundary:
While to view the model evaluation measures:
import seaborn as sns
# View model performance
print(f'Accuracy: {accuracy:.2f}')
print('Confusion Matrix:')
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Non-0', '0'], yticklabels=['Non-0', '0'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()
print('Classification Report:')
print(classification_rep)
Executing you get the following result: