how logistic regression works

sketching out logistic regression for my interview, because i need to get the fundamentals down.

the problem is we want to predict binary outcomes (0,1), but we can't do that with linear regression that predicts continuous.

how do we get a model that outputs probabilities between 0 and 1? and how can we make a decision boundary to produce binary outcomes?

the answer is a sigmoid function.

$p(x) = \frac{1}{1 + e^{-z}}$ where $z = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_nX_n$

this bounds the output between 0 and 1, and we can create a decision boundary at $p(x) = 0.5$ which is where $z = 0$

but what are we modeling here? the probabilities?

the key insight is that logistic regression doesn't directly model probabilities in a linear way - it models the log-odds.

Why log-odds? Because:

probability constraints: probability must be between 0 and 1, which isn't compatible with linear modeling (that produces unbounded values)
log-odds transformation: when we take $\log\left(\frac{p}{1-p}\right)$ , we transform the bounded 0-1 range into an unbounded range ( $-\infty$ to $+\infty$ )
linear relationship: This allows us to model log-odds as a linear function of features:

$\log\left(\frac{p}{1-p}\right) = z = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_nX_n$

The magic happens in this transformation. Consider:

So we're essentially saying:

this is why the coefficients in logistic regression represent changes in log-odds, and we can exponentiate them ( $e^{\beta}$ ) to get odds ratios.

but how do we estimate our coefficients ( $\beta_0 \to \beta_n$ ) that maximizes the probability of observing our training data?

we need a way to estimate the best coefficients ( $\beta_0, \beta_1, ..., \beta_n$ ) that maximize the probability of observing our training data.

to do that, we need MLE.

we use MLE to find the coefficients that make our observed data most likely:

first we want to likelihood function, to calculate it, we calculate the probability of its actual outcome for each data point

for a binary classification, the likelihood is

$L(\beta) = \prod p(x)^y \cdot (1-p(x))^{(1-y)}$

Where $y$ is the true label (0 or 1)

to make optimization easier, we take the log so it converts the multiplication into an addition, known as the log likelihood

$\log(L(\beta)) = \sum [y \cdot \log(p(x)) + (1-y) \cdot \log(1-p(x))]$

and unlike linear regression's closed-form solution, logistic regression uses iterative methods like gradient descent.

the goal is to find $\beta$ values that maximize this log-likelihood, essentially finding the most probable model given the data.

more resources

BENEDICT NEO 梁耀恩