Logistic Regression with Python

Post Views: 187

Logistic regression is a type of regression model used for binary classification problems, where the goal is to predict which of two classes an instance belongs to. Unlike linear regression, which predicts continuous values, logistic regression predicts probabilities that vary between 0 and 1. This is achieved by using a logistic (or sigmoid) function to transform the linear output into probabilities.

[wpda_org_chart tree_id=18 theme_id=50]

Logistic Regression

Logistic regression has an interesting history and was developed primarily to address binary classification problems. The concept of using the logistic function to model the probability of class membership was proposed by Joseph Berkson in 1944. During the 1950s and 1960s, the logistic regression approach was widely used in epidemiology to model the incidence of diseases based on risk variables. The method has been further developed and popularized in the context of statistical analyses, especially in medical and biological applications.

The Logistics Function (Sigmoid):

The logistics function is defined as:

$f(z) = \frac{1}{1 + e^{-z}}$

where $z$ is the linear combination of the weights $w$ and the independent variables $x$ (i.e. $z = w^Tx$ ) .

The logistic function transforms values in a range between 0 and 1, resulting in a characteristic “S” shape. This function is critical in logistic regression because it converts the model output into probabilities.

The Logistic Regression Model:

Logistic regression models the probability of class membership using the logistic function. The relationship between the independent variables and the probability of belonging to a class is expressed as:

$P(Y=1|X) = \frac{1}{1 + e^{-(w^Tx + b)}}$

Where:

$P(Y=1|X)$ is the conditional probability that the instance belongs to class 1 given the vector of independent variables $X$ .
$w$ are the weights associated with the independent variables.
$b$ is the term bias.

The probability of belonging to class 0 is simply $1 - P(Y=1|X)$ .

Model training:

Training the model involves estimating the weights ( $w$ and the bias term $b$ so that the model fits the training data. This can be done by maximizing the likelihood logarithmic interpretation of the training data or by minimizing a cost function such as logarithmic deviation. Logistic regression determines a decision boundary that separates the classes in the space of independent variables. This decision boundary is determined by the weights of the model. Logistic regression offers an interpretation intuitive results. The weights of the model indicate the relative contribution of each independent variable in determining the probability of belonging to a class.

Validation and Testing

After training, the model is validated and tested on unseen data to evaluate its performance on new data.

Extensions

Logistic regression can be extended for multiclass classification problems using techniques such as multinomial logistic regression.

In summary, logistic regression is an effective and interpretable model for binary classification, used in a variety of domains such as medicine, finance, and sentiment analysis.

If you want to delve deeper into the topic and discover more about the world of Data Science with Python, I recommend you read my book:

Python Data Analytics 3rd Ed

Fabio Nelli

Logistic Regression Example with Python

Let’s take a logistic regression example using Python and the scikit-learn module. In this example, we will use an example dataset provided by scikit-learn called iris, which is a classification dataset with three classes of flowers.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the iris dataset
iris = datasets.load_iris()
X = iris.data[:, :2]  # Take only the first two features for display
y = (iris.target != 0).astype(int)  # Consider only two classes (0 or non-0)

# Divide the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features (important for logistic regression)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Build and train the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions about the test set
y_pred = model.predict(X_test)

# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

# View the decision boundary
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# View the result
plt.contourf(xx, yy, Z, cmap=plt.cm.RdYlBu, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', cmap=plt.cm.RdYlBu)
plt.xlabel('Sepal Length (standardized)')
plt.ylabel('Sepal Width (standardized)')
plt.title('Logistic Regression Decision Boundary')
plt.show()

In this example, the logistic regression is trained on two of the classes of the iris dataset (setosa and versicolor) and the decision boundary is displayed on the first two features (sepal length and width). Model accuracy and other performance evaluation metrics are printed. By running the code you obtain the Decision Boundary:

While to view the model evaluation measures:

import seaborn as sns

# View model performance
print(f'Accuracy: {accuracy:.2f}')
print('Confusion Matrix:')
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Non-0', '0'], yticklabels=['Non-0', '0'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()
print('Classification Report:')
print(classification_rep)

Executing you get the following result: