how logistic regression works

April 1, 2025


sketching out logistic regression for my interview, because i need to get the fundamentals down.

the problem is we want to predict binary outcomes (0,1), but we can't do that with linear regression that predicts continuous.

how do we get a model that outputs probabilities between 0 and 1? and how can we make a decision boundary to produce binary outcomes?

the answer is a sigmoid function.

p(x)=11+ezp(x) = \frac{1}{1 + e^{-z}} where z=β0+β1X1+β2X2+...+βnXnz = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_nX_n

this bounds the output between 0 and 1, and we can create a decision boundary at p(x)=0.5p(x) = 0.5 which is where z=0z = 0

but what are we modeling here? the probabilities?

the key insight is that logistic regression doesn't directly model probabilities in a linear way - it models the log-odds.

Why log-odds? Because:

  1. probability constraints: probability must be between 0 and 1, which isn't compatible with linear modeling (that produces unbounded values)

  2. log-odds transformation: when we take log(p1p)\log\left(\frac{p}{1-p}\right), we transform the bounded 0-1 range into an unbounded range (-\infty to ++\infty)

  3. linear relationship: This allows us to model log-odds as a linear function of features:

    log(p1p)=z=β0+β1X1+β2X2+...+βnXn\log\left(\frac{p}{1-p}\right) = z = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_nX_n

The magic happens in this transformation. Consider:

  • If p=0.5p = 0.5, log-odds =0= 0
  • If p>0.5p > 0.5, log-odds >0> 0
  • If p<0.5p < 0.5, log-odds <0< 0
  • As pp approaches 11, log-odds approaches ++\infty
  • As pp approaches 00, log-odds approaches -\infty

So we're essentially saying:

  1. we want to model probability pp
  2. but we can't directly use linear regression on pp (bounded)
  3. so we transform pp to log-odds (unbounded)
  4. model log-odds linearly
  5. transform back to probability using sigmoid

this is why the coefficients in logistic regression represent changes in log-odds, and we can exponentiate them (eβe^{\beta}) to get odds ratios.

but how do we estimate our coefficients (β0βn\beta_0 \to \beta_n) that maximizes the probability of observing our training data?

we need a way to estimate the best coefficients (β0,β1,...,βn\beta_0, \beta_1, ..., \beta_n) that maximize the probability of observing our training data.

to do that, we need MLE.

we use MLE to find the coefficients that make our observed data most likely:

first we want to likelihood function, to calculate it, we calculate the probability of its actual outcome for each data point

for a binary classification, the likelihood is

L(β)=p(x)y(1p(x))(1y)L(\beta) = \prod p(x)^y \cdot (1-p(x))^{(1-y)}

Where yy is the true label (0 or 1)

to make optimization easier, we take the log so it converts the multiplication into an addition, known as the log likelihood

log(L(β))=[ylog(p(x))+(1y)log(1p(x))]\log(L(\beta)) = \sum [y \cdot \log(p(x)) + (1-y) \cdot \log(1-p(x))]

and unlike linear regression's closed-form solution, logistic regression uses iterative methods like gradient descent.

the goal is to find β\beta values that maximize this log-likelihood, essentially finding the most probable model given the data.


more resources