Modeling probabilities
Maximum likelihood
Logistic regression
Osteoporosis is a bone disease leading to brittle bones.
Bones lose calcium.
Older women are more likely to have it than other demographics.
Lack of heavy lifting is a risk factor.
It is more prevalent in underweight people than overweight people.
Imagine a patient, an older woman, comes into the medical office.
X-ray is typically definitive.
Of course, X-ray is also expensive.
If the woman has a low risk for osteoporosis, maybe it is not worth it.
Let's run a simplified cost analysis.

Two possible states:
Patient has osteoporosis or does not have osteoporosis.
Two actions:
Send for X-ray or not.
Imagine the following values for the net cost of a given action:
| No X-ray | Yes X-ray | $P(\cdot)$ | |
|---|---|---|---|
| Okay | 0 | $400 | $1-p$ |
| Osteo | $3600 | 0 | $p$ |
| Expected Value | $3600$p$ | $400$(1-p)$ |
Rule: send for X-ray if $400(1-p)\leq 3600p \Rightarrow \mathbf{p \geq 1/10}$.
Our data set contains $n=864$ patients and 6 covariates.
Response: whether the individual has osteoporosis.
Covariates: various risk factors.
The covariates include:
Weight
Age
Height
Years taking estrogen supplement
Calcium supplement?
Recent bone fracture?
We randomly split into a training set and a test set. We will now look at the training set of size $n_{\text{train}} = 604$.
For the moment, suppose we only had access to whether or not an individual in the training set has osteoporosis, with no other information.
Suppose our data set could be modeled as $n$ i.i.d. random variables.
What random variable would we use to model whether or not the $i$-th individual has osteoporosis?
Do we know the parameter that governs this random variable?
How would we estimate it based on our training set?
How would we conduct inference for it?
From the training data:
228 individuals have osteoporosis.
376 individuals do not have osteoporosis.
For now, we will ignore the cost analysis and focus on correctly classifying whether or not an individual has osteoporosis.
If I have no other information beyond the fact that the person is from the population of interest, what should I predict in order to maximize accuracy, the percentage of correct classifications?
Baseline model: always predict that a person does not have osteoporosis.
Correct on all 165 individuals in the test set without osteoporosis.
Incorrect on 95 individuals in the test set with osteoporosis.
Obtains 63% accuracy in the test set with no real work.
But we have all sorts of other information.
We would certainly hope to do better in predicting osteoporosis based on other covariates.
Let's return to the cost analysis. The decision rule was to send for X-ray if $400(1-p)\leq 3600p \Rightarrow \mathbf{p \geq 1/10}$.
From my training set, $\hat{p} = 0.39$.
If I have no other information for each individual, what should my decision be? Send everyone to get an X-ray.
In the test set, 165 out of 260 individuals do not have osteoporosis.
Average cost of $(165/260)400 = 253.84$.
Again, might we be able to use covariate information to reduce costs?
Our first instinct might be to fit a regression model
\[ \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_{i1} + \cdots + \hat{\beta}_p x_{ip} \]Multiple regression attempts to model the conditional expectation of $y_i$ given $x_i$.
If linearity held, then
\[ E(y_i \mid x_i) = \beta_0 + \beta_1x_{i1} + \cdots + \beta_p x_{ip} \]What does the expectation of a binary variable equal?
\begin{align*} E(y_i \mid x_i) &= 1 \times P(y_i=1 \mid x_i) + 0 \times P(y_i=0 \mid x_i) \\ &= P(y_i=1 \mid x_i) \end{align*}In the binary-response setting, the multiple regression model would say that
\[ p(x_i) = P(Y_i=1 \mid x_i) = \beta_0 + \beta_1x_{i1} + \cdots + \beta_p x_{ip} \]That is, the probability of success is linear in the covariates.
The regression line returned by R would then be providing estimates for the probability of
success:
To visualize, let's consider a simple regression example with a binary response.
The y variable here is an indicator of whether or not an individual in the data set has
osteoporosis, while x is the individual's weight.
The line, believe it or not, is the line of best fit.
Even if the probabilities were linear in the covariates, there are other issues when using linear regression to fit binary outcomes. The stronger linear model cannot hold.
$\varepsilon_i$ terms cannot be normally distributed if the outcome is binary. That would imply continuous responses.
If $y$ is Bernoulli with probability of success given $x$ equal to $p(x)$, then
\[ \operatorname{var}(y \mid x) = p(x)\{1-p(x)\} \]Homoskedasticity also cannot hold. The variance of the responses varies with $x_i$.
While these issues could be remediated with HC standard errors and with the bootstrap, given the fundamental issues with linearity when predicting probabilities, we take a different approach.
Directly fitting a linear model to binary data fails to account for the constraint that $0\leq p(x)\leq 1$.
Instead of fitting a linear function for $p(x)$, we can fit a linear function of some transformation of $p(x)$ that is unbounded.
This is called finding a link function.
For regression with binary responses, by far the most common choice is the logit link:
\[ \operatorname{logit}(p(x)) = \log\left(\frac{p(x)}{1-p(x)}\right) \]$\operatorname{logit}(p(x))$ is also known as the log-odds, since $p(x)/(1-p(x))$ is the odds of observing a 1 given $x$.
We now imagine that a linear model holds on the logit scale:
\[ \log\left(\frac{p(x_i)}{1-p(x_i)}\right) = \beta_0 + \beta_1x_{i1} + \cdots + \beta_p x_{ip} \]Solving for $p(x_i)$ yields
\[ \textstyle p(x_i) = \frac{\exp\{\beta_0 + \beta_1x_{i1} + \cdots + \beta_p x_{ip}\}} {1+\exp\{\beta_0 + \beta_1x_{i1} + \cdots + \beta_p x_{ip}\}} \]$\beta_j$ are unknown parameters of the model, governing the true probability of a 1 given $x$.
Given our data set, we seek to estimate $\beta_j$.
Properties:
$0 < \operatorname{logistic}(\theta) < 1$
$\operatorname{logistic}(0)=\frac{1}{2}$
$\operatorname{logistic}(\theta)$ is an increasing function of $\theta$
Now let $\theta = \beta_0 + \beta_1x_1 + \cdots + \beta_p x_p$.
If $\beta_j > 0$, then the function is increasing in $x_j$.
If $\beta_j < 0$, then the function is decreasing in $x_j$.
We will use the observed data to estimate the parameters $\beta_0,\dots,\beta_p$. That is, we will use data to estimate
\[ \textstyle p(x) = \frac{\exp\{\beta_0 + \beta_1x_{i1} + \cdots + \beta_p x_{ip}\}} {1+\exp\{\beta_0 + \beta_1x_{i1} + \cdots + \beta_p x_{ip}\}} \]by means of
\[ \textstyle \hat{p}(x) = \frac{\exp\{\hat{\beta}_0 + \hat{\beta}_1x_{i1} + \cdots + \hat{\beta}_p x_{ip}\}} {1+\exp\{\hat{\beta}_0 + \hat{\beta}_1x_{i1} + \cdots + \hat{\beta}_p x_{ip}\}} \]The process of fitting the function $\hat{p}(x)$ is known as logistic regression.
glm(Osteo ~ Weight, family = "binomial", data = train)
Coefficients:
(Intercept) Weight
4.52637 -0.03183
\[
\log\left(\frac{\hat{p}(\mathtt{weight})}{1-\hat{p}(\mathtt{weight})}\right) = 4.52 - 0.0318 \times
\mathtt{weight}
\]
\[
\hat{p}(\mathtt{weight}) = \frac{\exp\{4.52 - 0.0318 \times \mathtt{weight}\}}
{1 + \exp\{4.52 - 0.0318 \times \mathtt{weight}\}}
\]
Here is a plot of the estimated $\hat{p}(x)$. Again, points are jittered to avoid overplotting.
How did R come up with this equation?
In linear regression, R finds the line that minimizes the in-sample SSE.
With binary data, that strategy does not respect the fact that we are predicting probabilities.
Instead, R comes up with this equation through maximum likelihood estimation.
Suppose I knew the values of $\beta_0,\beta_1,\dots,\beta_p$.
At any point $x_i = (1,x_{i1},\dots,x_{ip})^\top$, what is the probability of observing a one given $x_i$? What about a zero?
\[ \textstyle P(y_i=1 \mid x_i) = \frac{\exp\{x_i^\top\beta\}}{1+\exp\{x_i^\top\beta\}}, \qquad P(y_i=0 \mid x_i) = \frac{1}{1+\exp\{x_i^\top\beta\}} \]I can write this concisely as
\[ \textstyle p(y_i \mid x_i) = \left(\frac{\exp\{x_i^\top\beta\}}{1+\exp\{x_i^\top\beta\}}\right)^{y_i} \left(\frac{1}{1+\exp\{x_i^\top\beta\}}\right)^{1-y_i} \]Do you see why? Evaluate by setting $y_i=1$ and $y_i=0$.
In my data set, I have $n$ pairs $(x_1,y_1),\dots,(x_n,y_n)$. We assume independence between observations.
Given $x_1,x_2,\dots,x_n$, what is the probability of observing $y_1$ and $y_2$ and $\cdots$ and $y_n$?
\[ p(y_1 \cap y_2 \cap \cdots \cap y_n \mid x_1,\dots,x_n) = \prod_{i=1}^n p(y_i \mid x_i) \] \begin{align*} &p(y_1 \cap y_2 \cap \cdots \cap y_n \mid x_1,\dots,x_n) = \\ &\quad\prod_{i=1}^n \left(\frac{\exp\{x_i^\top\beta\}}{1+\exp\{x_i^\top\beta\}}\right)^{y_i} \left(\frac{1}{1+\exp\{x_i^\top\beta\}}\right)^{1-y_i} \end{align*}Plugging in the values of $y_i$ and $x_i$ from my data gives
\[ \textstyle P(y \mid x) = \prod_{i=1}^n \left(\frac{\exp\{x_i^\top\beta\}}{1+\exp\{x_i^\top\beta\}}\right)^{y_i} \left(\frac{1}{1+\exp\{x_i^\top\beta\}}\right)^{1-y_i} \]I know $\{y_i\}$ and $\{x_i\}$ from my data, but $\beta_0,\dots,\beta_p$ are unknown.
What values for $\beta = (\beta_0,\dots,\beta_p)^\top$ maximize the likelihood of observing the data?
\[ \textstyle \hat{\beta} = \underset{\beta}{\operatorname{arg\,max}} \; \prod_{i=1}^n \left(\frac{\exp\{x_i^\top\beta\}}{1+\exp\{x_i^\top\beta\}}\right)^{y_i} \left(\frac{1}{1+\exp\{x_i^\top\beta\}}\right)^{1-y_i} \]There is no closed-form solution, but we can solve it numerically.
The maximizers, often called the MLEs for maximum likelihood estimates, are
\[ \hat{\beta} = (\hat{\beta}_0,\hat{\beta}_1,\dots,\hat{\beta}_p)^\top \]Many familiar estimators can be motivated as maximum likelihood estimators under specific distributional assumptions.
If $y_1,\dots,y_n$ are i.i.d. Bernoulli with unknown probability of success $p$, then $\hat{p}$, the fraction of ones in the data, is the MLE.
If $y_1,\dots,y_n$ are i.i.d. Normal with unknown mean $\mu$ and unknown standard deviation $\sigma$, then the MLE for $\mu$ is $\bar{y}$.
Under the stronger linear model,
with $\varepsilon_i$ i.i.d. with mean zero and standard deviation $\sigma_\varepsilon$, the MLEs for $(\beta_0,\beta_1,\dots,\beta_p)$ are the OLS coefficients,
\[ (\hat{\beta}_0,\hat{\beta}_1,\dots,\hat{\beta}_p) = (X^\top X)^{-1}X^\top y \]Define $L(\beta \mid y, x)$ as
\[ \textstyle L(\beta \mid y, x) = \prod_{i=1}^n \left(\frac{\exp\{\beta_0+\sum_{j=1}^p\beta_jx_{ij}\}} {1+\exp\{\beta_0+\sum_{j=1}^p\beta_jx_{ij}\}}\right)^{y_i} \left(\frac{1}{1+\exp\{\beta_0+\sum_{j=1}^p\beta_jx_{ij}\}}\right)^{1-y_i} \]We find the coefficients $\{\hat{\beta}_j\}$ through the following optimization problem:
\[\textstyle \argmax_\beta\; L(\beta \mid y, x) \]Or, equivalently due to monotonicity of the logarithm,
\[\textstyle \argmin_\beta\; -\log\left(L(\beta \mid y, x)\right) \]In practice, most maximum likelihood estimators are actually found by minimizing the negative log likelihood.
Define $p_\beta(x_i)$ as what the probability would be given $x_i$ for a fixed value of $\beta$:
\[ \textstyle p_\beta(x_i) = \frac{\exp\{\beta_0+\sum_{j=1}^p\beta_jx_{ij}\}} {1+\exp\{\beta_0+\sum_{j=1}^p\beta_jx_{ij}\}} \]Taking the negative of the log likelihood,
\begin{align*} &-\log\left(L(\beta \mid y, x)\right) \\ &\textstyle\quad=\sum_{i=1}^n \{-y_i\log(p_\beta(x_i))-(1-y_i)\log(1-p_\beta(x_i))\} \end{align*}The log loss for a predicted probability $\theta_i$ and observed outcome $y_i$ is
\[ -y_i\log(\theta_i) - (1-y_i)\log(1-\theta_i) \]Useful to break this down into cases:
\[ \begin{cases} -\log(\theta_i) & y_i = 1\\ -\log(1-\theta_i) & y_i = 0 \end{cases} \]When the outcome is one, loss is smallest when $\theta_i = 1$ and tends to infinity as $\theta_i$ goes to zero.
When the outcome is zero, loss is smallest when $\theta_i = 0$ and tends to infinity as $\theta_i$ goes to one.

Linear regression and logistic regression both choose $\hat{\beta}$ by minimizing the sum of losses, but they differ in their choice of loss function.
Linear regression minimizes the sum of squared error losses.
Logistic regression minimizes the sum of log losses.
Here is the logistic regression fit for our osteoporosis data using all available covariates.
glm(Osteo ~ ., family = "binomial", data = train)
Coefficients:
(Intercept) Weight Age Height
2.29400 -0.03068 0.06857 -0.03749
Years.estrogen Calcium Fracture
-0.02088 -0.04248 0.96610
\[
\hat{p}(x) = \frac{\exp\{\hat{\beta}_0 + \sum_{j=1}^p \hat{\beta}_j x_j\}}
{1+\exp\{\hat{\beta}_0 + \sum_{j=1}^p \hat{\beta}_j x_j\}}
\]
\[
\log\left(\frac{\hat{p}(x)}{1-\hat{p}(x)}\right) = \hat{\beta}_0 + \sum_{j=1}^p \hat{\beta}_j x_j
\]
Our interpretation for any of the estimated slope coefficients $\hat{\beta}_j$, $j=1,\dots,p$, from logistic regression depends upon the other variables within the model.
Ceteris Paribus
All else equal, two individuals who differ in variable $x_j$ by 1 unit are estimated to differ in their log-odds of success by $\hat{\beta}_j$, or equivalently by an odds ratio of $\exp\{\hat{\beta}_j\}$.
That is, $\hat{\beta}_j$ is the estimated difference in log-odds for two individuals who differ in covariate $x_j$ by 1 unit, but have the same values for all of the other predictor variables.
Unfortunately, the model is not linear on the probability scale.
The resulting difference in predicted probability depends on the particular values of the covariates.
At the very least, we can interpret the sign.
Positive $\hat{\beta}_j \Rightarrow$ all else fixed, higher values of $x_j$ have higher predicted probability.
Negative $\hat{\beta}_j \Rightarrow$ all else fixed, higher values of $x_j$ have lower predicted probability.
The summary table from logistic regression looks very similar to that of linear regression.
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.294001 2.625111 0.874 0.382190
Weight -0.030680 0.004189 -7.324 2.40e-13 ***
Age 0.068569 0.011672 5.874 4.24e-09 ***
Height -0.037486 0.038888 -0.964 0.335066
Years.estrogen -0.020881 0.012125 -1.722 0.085050 .
Calcium -0.042481 0.268710 -0.158 0.874384
Fracture 0.966103 0.265801 3.635 0.000278 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Logistic regression and linear regression are two members of a broader class of models known as generalized linear models.
This class involves
Distributional assumption on the outcome variable
Link function: a transformation $g$ of $E(y \mid x)$ that is modeled as linear in $x$
Examples:
The function glm allows us to fit generalized linear models.