What is Logistic regression?

Logistic regression is a simple classification algorithm where the output or the dependent variable is categorical. For example:

To classify an email into the spam or not spam
To predict whether a patient has cancer or not

Logistic regression uses a logistic function for this purpose and hence the name. This algorithm can be thought of as a regression problem even though it does classification. Because instead of just giving the class, logistic regression can tell us the probability of a data point belonging to each of the classes. We will see the details in the coming sessions.

Geometric Intuition

We all know how to train a Logistic Regression and how to make predictions using the trained model. But what are the core mathematical concepts behind this algorithm? How is it trained? Let’s understand the mathematics geometrically.

Consider the distribution of some data in a two-dimensional space. The data has two independent attributes X1 and X2. That means the position of each data point in this two-dimensional space depends on the values of X1 and X2. Now, as we can see there are two types of data points – red and green. There is a line drawn separating these points. So we can say that this line is a model that can classify these data points into red or green. Logistic Regression is all about finding this optimum line or a plane that will separate or classify the data points into different classes. So the data should be almost linearly separable to be classified using a line or plane.

How do we find this line given a set of points? Given this line how do we get the class corresponding to each data point without visualizing it?

Predictions from the model

For the time being, let’s assume we got the perfect line separating or classifying a set of data points as shown in the above figure. Now how do we get the class corresponding to each data point?

Let’s modify the above figure slightly.

Let’s take the equation of this line as y=ax1+bx2+c where c is the intercept. Let’s ignore c for now and the equation becomes y=ax1+bx2.

In this equation (a,b) can be thought of as a perpendicular vector to the line y. In the above figure, we consider that positive class points are lying on the side of the line where the normal is pointing to. Now given any point x at (x1, x2), the distance of this point from the line will be

d=(a,b) . (x1,x2) / (||(a,b)||)

Where (a,b) . (x1,x2) is the dot product between the data point and normal vector, ||(a,b)|| is the second norm of this vector or we can say the euclidean distance of the vector from the origin. So if (x1,x2) is in the direction of (a,b), then the dot product will be positive and hence d will be positive, and hence (x1,x2) will be of class 1 or positive class.

Consider the point x (x1,x2). The distance d1 from the line is measured in the direction of (a,b). So the dot product and d1 will be positive and the output is of the positive class. Similarly for point x’, the distance from the line is d2 which is of negative sign. Hence x’ belongs to the negative class.

Now, instead of just giving the class labels as output, we can give a probabilistic interpretation of the output. For that, we apply a sigmoid function on this signed distance. This function is an “S” shaped function and is also known as the Logistic Function.

To get the probabilistic output, we just take the sigmoid of the output. The output from the sigmoid function can be thought of as the probability of a point to be in class 1 or positive class. From the figure, we can see that for x, if d is positive, sigmoid(d) is greater than 0.5. Or the probability of output to belonging to a positive class or class 1 is greater than 0.5. Similarly, if d is negative, sigmoid(d) is less than 0.5 so the probability of x belonging to class 1 is low. The probability we obtain is the probability of the data point to be in positive class.

In conclusion, we are using the logistic function and predict the probability of output belonging to the positive class. That is why this algorithm is called Regression even though it is a classification algorithm.

Finding the Optimum Separator

How to find the optimum line or plane separating the classes?

Let’s assume that the line separating the classes is y=ax1+bx2. Here the data is two dimensional. If the data is n-dimensional the plane separating the classes will be y= a1x1 + a2x2 + a3x3 + ….+ anxn. For simplicity, we consider the two-dimensional data itself.

To get the optimum line, we need to find the optimum coefficients. Or in other words, we need to find the perpendicular vector for the best line. How do we find it?

Our aim is to classify maximum data points correctly. Consider a data point x. If the actual output y_true corresponding to x is 1, then our model output should be 1. If we are using probability score (p) as output then the output must be closer to 1. If y_true is 0, the probability (p) predicted by the model should be nearly 0.

In other words, we need to maximize for class 1 and for class 0.

Substituting the values of probability we can write that

So our aim is to find the coefficients that can minimize this score. This is done by solving the gradient descent algorithm.

One of the main problems with gradient descent algorithms is that the optimization is more likely to be stuck in local minima. To avoid this problem we introduce a monotonic function to our score function. The monotonic function used here is a log function. This log function will not impact the score due to its monotonic behavior. The modified equation becomes

We can write this as

Where p is the probability of belonging to class 1 or positive class. This is the equation for log loss in a two-class classification problem.

Now after getting the optimum coefficients we can find the output for a data point in terms of probability as:

where (a*,b*) . (x1,x2) is the signed distance of x measured from the line. Now if the value of p is greater than 0.5 we can classify it into class 1 otherwise class 0.

I hope this post helped you in gaining a basic understanding of the mathematics behind logistic regression. You can connect me through LinkedIn for any queries.

Mathematics Behind Logistic Regression

What is Logistic regression?

1 thought on “Mathematics Behind Logistic Regression”