Logistic regression as maximum Bernoulli likelihood

This post supplements the classification slides. Please see the slides for the setup.

The Bernoulli regression model is

\[\def\Ber{\mathrm{Ber}} \Pr\{Y=1\mid X=x\} \sim \Ber(s(\beta_0 + \beta^\top x)),\]

where \(s\) is the sigmoid function: \(s(z) = \frac{1}{1 + e^{-z}}\) and \(\beta_0\in\reals\) and \(\beta\in\reals^d\) are model parameters. In other words, the PMF of the output label (conditional on the inputs) is

\[\Pr\{Y=y\mid X=x\} = s(\beta_0 + \beta^\top x)^y(1- s(\beta_0 + \beta^\top x))^{1-y}.\]

We fit the model to training data by estimating the parameters with maximum likelihood.

First, we need to define the likelihood function, which outputs the probability of observing the training data from the distribution with the input parameter models. For the logistic regression model, the probability of observing the training data \(\{(X_1,Y_1),\dots,(X_n,Y_n)\}\) from parameters \((\beta_0,\beta)\) is

\[\begin{aligned} L(\beta_0,\beta) &\triangleq \Pr\{\{(X_i,Y_i)\}_{i=1}^n;\beta_0,\beta\} \\ &= \prod_{i=1}^n\Pr\{Y_i\mid X_i;\beta_0,\beta\} &\text{(samples are independent)} \\ &= \prod_{i=1}^ns(\beta_0 + \beta^\top X_i)^{Y_i}(1- s(\beta_0 + \beta^\top X_i))^{1-Y_i} &\text{(logistic regression model)}. \end{aligned}\]

Second, we maximize the likelihood to find the parameters that best fit the data. In practice, we usually maximize the log of the likelihood (called the log-likelihood) or minimize the negative of the log-likelihood (called the negative log-likelihood). For logistic regression, the log-likelihood is

\[\begin{aligned} &\log L(\beta_0,\beta) \\ &\quad= \sum_{i=1}^nY_i\log s(\beta_0 + \beta^\top X_i) + (1-Y_i)(\log(1- s(\beta_0 + \beta^\top X_i)) &\text{(properties of log)}\\ &\quad= \sum_{i=1}^n{\textstyle Y_i\log\frac{s(\beta_0 + \beta^\top X_i)}{1-s(\beta_0 + \beta^\top X_i)}} + \log(1- s(\beta_0 + \beta^\top X_i). \end{aligned}\]

We leave as an algebra exercise to check that

\[\textstyle \log\frac{s(\beta_0 + \beta^\top X_i)}{1-s(\beta_0 + \beta^\top X_i)} = \beta_0 + \beta^\top X_i.\]

This simplification leads us to the more familiar form of the logistic regression log-likelihood:

\[\log L(\beta_0,\beta) = \sum_{i=1}^nY_i(\beta_0 + \beta^\top X_i) + \log(1-s(\beta_0 + \beta^\top X_i)).\]

Posted on August 09, 2024 from Ann Arbor, MI.