Machine Learning: Introduction to Logistic Regression

June 15, 2023

This article is about Logistic Regression that describes its introduction in Machine Learning world with its importance and applications.

Also, let’s take a look at Machine Learning: Types of Machine Learning.

Logistic Regression

Logistic Regression is another method of Machine Learning for predicting the output. It is majorly used for classification and predicting the output. It determines the relationship between the dependent binary variable and one or more independent variables based on predictive analytics.

A binary value is based on categorical data and returns only two outputs: Yes or No, True or False, Right or Wrong, On or Off, etc.

Logistic Regression also called the Logit Model, returns the probability of an event occurring, such as Yes or No, based on a given dataset of independent variables. Since the outcome is the probability, it returns the dependent variable outcome between 0 to 1.

In logistic Regression, the logit transformation can be applied to odds. The possibility of success is divided by the possibility of failing. This is often referred to as log odds, and this function of logistic Regression is expressed in these formulas:

Logit(pi) = 1/(1+ exp(-pi))

ln(pi/(1-pi)) = Beta_0 + Beta_1*X_1 + … + B_k*K_k

Logit (pi) is the dependent or response variable in this logistic regression equation. In contrast, an independent variable called x. Beta, also known as a coefficient in this model is typically calculated using the method of maximum likelihood estimation (MLE).

The method tests different beta values through several variations to determine the best fitting of log odds. Each iteration produces the log-likelihood equation, and the logistic Regression tries to optimize this function to discover the best parameter estimation. When the most effective coefficient (or coefficients, if there are several independent variables) is identified, the conditional probabilities for each event can be determined, logged, and then summed to give the probability of prediction.

In the case of binary classification, a likelihood lower than .5 is predicted to be 0, and any probability higher than 0 is predicted to be 1. Once the model has been calculated, it is a good idea to determine how well it can predict the dependent variable, which is a good fit.

Logistic Regression rather than Linear Regression?

The most basic difference between Linear Regression and Logistic Regression is that Linear Regression takes the dependent variable as numeric values, whereas Logistic Regression has a binary (yes/no) dependent variable.

Key reasons why logistic Regression is preferred over linear Regression when dealing with categorical outcomes:

Binary Classification: Logistic Regression is specifically designed for binary classification tasks, where the goal is to predict whether an instance belongs to one of two classes (e.g., yes or no, true or false). It models the probability of the binary outcome using a logistic function, also known as the sigmoid function, which maps the continuous input to a probability value between 0 and 1.
Probability Interpretation: Logistic Regression provides probability estimates directly interpreted as the likelihood of an instance belonging to a particular class. This is useful in scenarios where we must make decisions based on the probability of an event occurring rather than a simple yes/no prediction.
Non-Linear Relationship: Logistic Regression allows for non-linear relationships between the predictors and the probability of the outcome. It can capture complex patterns and interactions among variables. Linear Regression, on the other hand, assumes a linear relationship between the predictors and the outcome, which may not be appropriate for categorical outcomes.
Robustness to Outliers: Logistic Regression is generally more robust to outliers than linear Regression. Outliers in the continuous outcome variable can heavily influence the estimated coefficients in linear Regression, leading to biased predictions. Logistic Regression, focusing on class probabilities, is less affected by extreme values.
Log Odds Interpretation: Logistic Regression provides coefficients representing the log odds or logit of the outcome variable. This log odds interpretation allows for understanding the direction and magnitude of the effects of predictors on the probability of the outcome. In linear Regression, coefficients directly represent the change in the mean of the continuous outcome for a unit change in the predictor.

Linear Regression vs Logistic Regression

Linear Regression can be used to understand the relationship between the continuous dependent variable and one or more independent variables. When there is just one independent variable, it’s called simple linear Regression; however, as the number of independent variables increases, it is termed multiple linear Regression. For every type of linear Regression, attempts to draw a line of the most optimal fit over data points. It is usually determined by applying the least squares method.

Like linear Regression, logistic Regression can also assess the relationship between dependent and several independent variables. However, it’s used to predict a categorical versus the continuous one. A categorical value can be true or false and 1 or 0. The measurement unit also differentiates itself from linear Regression because it creates a probability; however, the logit function transforms the S-curve into a straight line.

Both models are employed when regression analysis provides predictions outcome; Linear Regression is usually simpler to comprehend. It also doesn’t require the same sample size as logistic Regression. However, a sufficient sample size is required to reflect responses across all categories. With a bigger, more proportionate sample of participants, the model might be statistically strong enough to detect a significant impact.

Types of Logistic Regression

Types of Logistic Regression based on categorical data –

Binary Logistic Regression: Binary Logistic Regression is dichotomous or binary (0, 1) in nature meaning, it has only two possible outcomes that find the relationship between those outcomes such as yes or no. E.g., the Movie will start delayed (yes/no), Person has a disease (yes/no), and the email is spam or not.
Multinomial Logistic Regression: Multinomial Logistic Regression returns three or more possible outcomes of the dependent variable. However, these outcomes have no order. It finds the relationship between the possible outcomes based on the reference category, such as a car being predicted based on its colour, shape, and design.
Ordinal Logistic Regression: Ordinal Logistic Regression returns three or more possible outcomes of the dependent variable, with order or sequence. It finds the relationship between the possible outcomes, e.g., predicting the severity of disease (mild, moderate, severe) based on various predictors.

Applications of Logistic Regression

Disease Diagnosis: Predicting whether a patient has a specific disease based on symptoms, medical history, or test results.
Fraud Detection: Identifying fraudulent transactions or activities based on patterns and characteristics.
Credit Scoring: Predicting the likelihood of default or credit risk for loan applicants.
Product Preference Analysis: Predicting customer preferences for specific products or features.
Education Research: Predicting factors influencing student outcomes, such as graduation or dropout rates.
Disease Risk Factors: Analyzing the influence of various factors on the risk of developing a disease or condition.

Conclusion

Logistic Regression is a powerful statistical technique widely used for binary classification and probability estimation tasks. It is specifically designed for situations where the outcome variable is categorical or binary, taking on two classes. The article presents a comprehensive overview of logistic Regression, its underlying principles, practical considerations, and real-world applications.