Non-Gaussian/normal error terms
Asymptotically valid inference under non-normality
The bootstrap
In the first part of this course, we
Derived certain consequences of the ordinary least squares slope coefficients under the assumptions of the weaker and stronger linear models
Discussed diagnostics to assess whether these assumptions seem reasonable
Now: What happens if certain assumptions seem unreasonable? Can we proceed? If so, how?
What if the true trend doesn't seem linear? (Coming soon)
What happens if the data is actually heteroskedastic? (Lecture 14)
What if normality seems violated? (Today)
Consider our model for heteroskedasticity:
\begin{align*} y = X\beta + \varepsilon;\;\; E(\varepsilon) = 0;\;\;\text{Var}(\varepsilon) = \sigma^2_\varepsilon W^{-1}. \end{align*}where $\varepsilon_1,\ldots,\varepsilon_n$ are independent, and $W$ is a $n\times n$ diagonal matrix with $w_{ii} > 0$ on the diagonals (and zeroes on the off-diagonals).
Under this model, is the OLS estimate $\hat{\beta}$ unbiased for $\beta$?
If we know $W$, should we still use ordinary least squares to estimate $\beta$? If not, what estimator should we use instead?
What was the justification for using this estimator instead of ordinary least squares?
If we don't know $W$, can we come up with new standard errors for $\hat{\beta}_{OLS}$ that allow us to conduct inference? What were these called?
We'll now assess the impact of departures from normality on inference
If $\varepsilon$ isn't actually normally distributed, how do my hypothesis tests and confidence intervals perform?
Do the methods for inference perform better as the sample size increases?
Is there an alternative approach to inference when normality is violated?
Under the stronger linear model,
\begin{align*} y = X\beta + \varepsilon;\;\; \varepsilon \sim MVN(0, \sigma^2_\varepsilon I), \end{align*}we obtained the result that
\begin{align*} \frac{\hat{\beta}_j - \beta_j}{\text{se}(\hat{\beta}_j)} &\sim t_{n-p-1}, \end{align*}where $\hat{\beta}_j$ is the OLS slope coefficient, and $\text{se}(\hat{\beta}_j)$ is the conventional
standard error estimate returned by R.
What happens when $\varepsilon$ is not multivariate normal?
When $\varepsilon$ isn't normally distributed, we can't guarantee that $\hat{\beta}_j$ is normally distributed for all sample sizes $n$
Hypothesis tests using the $t_{n-p-1}$ to compute $p$-values can have Type I error rates larger than or smaller than $\alpha$
Confidence intervals using quantiles from a $t_{n-p-1}$ might have actual coverage that is larger than or smaller than $100(1-\alpha)\%$.
As long as $n$ is large, $\widehat{\beta}_j$ turns out to be approximately normally distributed.
Recall the central limit theorem from your introductory statistics class: Suppose $y_1,\ldots,y_n$ are $iid$ with $E(y_i) = \mu$, $\text{Var}(y_i) = \sigma^2 < \infty$. Then, regardless of the distribution of $y_i$, as $n\rightarrow \infty$
\begin{align*} \frac{\bar{y} - \mu}{\sigma/\sqrt{n}} &\overset{d}{\rightarrow} N(0, 1). \end{align*}Consider our model for heteroskedasticity:
\begin{align*} y = X\beta + \varepsilon;\;\; E(\varepsilon) = 0;\;\;\text{Var}(\varepsilon) = \sigma^2_\varepsilon W^{-1}. \end{align*}where $\varepsilon_1,\ldots,\varepsilon_n$ are independent, and $W$ is a $n\times n$ diagonal matrix with $w_{ii} > 0$ on the diagonals (and zeroes on the off-diagonals). Let $\hat{\beta}$ be the OLS slope coefficients, and recall that
\begin{align*} \text{Var}(\hat{\beta}) &= \sigma^2_\varepsilon (X^\top X)^{-1}X^\top W^{-1}X(X^\top X)^{-1} \end{align*}We will now consider the distribution of
\begin{align*} (1)\; \frac{\hat{\beta}_j - \beta_j}{\text{SD}(\hat{\beta}_j)};\;\; (2)\; \frac{\hat{\beta}_j - \beta_j}{\text{se}_{HC2}(\hat{\beta}_j)};\;\; (3)\; \frac{\hat{\beta}_j - \beta_j}{\text{se}(\hat{\beta}_j)}, \end{align*}where $\text{SD}(\hat{\beta}_j) = \sqrt{\text{Var}(\hat{\beta}_j)}$, $\text{se}(\cdot)$ is the conventional
standard error returned by R, and $\text{se}_{HC2}(\cdot)$ is the heteroskedasticity-consistent
standard error.
Asymptotic Normality
Under our model for heteroskedasticity and under suitable conditions on $\varepsilon$ and $X$, as $n\rightarrow \infty$
\begin{align*} \frac{\hat{\beta}_j - \beta_j}{\text{SD}(\hat{\beta}_j)} &\overset{d}{\rightarrow} N(0,1);\;\;\;\;\; \frac{\hat{\beta}_j - \beta_j}{\text{se}_{HC2}(\hat{\beta}_j)} \overset{d}{\rightarrow} N(0,1) \end{align*}Asymptotic Normality
If homoskedasticity holds $(W=I)$, then we further have that as $n\rightarrow \infty$
\begin{align*} \frac{\hat{\beta}_j - \beta_j}{\text{se}(\hat{\beta}_j)} &\overset{d}{\rightarrow} N(0,1) \end{align*}Under heteroskedasticity, $\frac{\hat{\beta}_j - \beta_j}{\text{se}(\hat{\beta}_j)}$ will still be asymptotically normally distributed with mean zero, but with a variance that does not equal 1.
Because $\hat{\beta}_j$ is asymptotically normal, confidence intervals and hypothesis tests using the $t$ distribution will also be asymptotically valid as $n\rightarrow \infty$
Under homoskedasticity: we can use the usual standard errors from R
Under heteroskedasticity: need to replace the usual standard errors with $HC$ standard errors.
In practice: for $n$ reasonably large, methods for inference using the $t$ will be roughly correct so long as we use a valid standard error.
Confidence intervals will have coverage close to $100(1-\alpha)\%$ for $n$ "large enough"
Hypothesis tests will have Type I error rate close to $\alpha$ for $n$ "large enough"
How large is "large enough" depends on distribution of $\varepsilon$. More skewness, need larger $n$. Symmetric, smaller $n$ required.
While asymptotic normality provides an assurance that things won't go too far off the rails when $n$ is large enough, we generally don't know how suitable the approximation will be for any particular data set.
Since we don't know the distribution of $\varepsilon$, we can't use simulation to assess how close to normal the distribution of $\hat{\beta}$ will be
Is there a better way to proceed? Can we try to use the data itself to approximate the distribution of $t_{stat}$, rather than relying on the central limit theorem?
Yes, through a method called the bootstrap.
We'll now build intuition for the bootstrap through a simple example without covariates (we will consider inference for the population mean). Suppose that $y_i$, $i=1,\ldots,50$, are $iid$ draws from the following discrete distribution
| $c$ | 0 | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|---|
| $P(y_i = c)$ | 0.75 | 0.15 | 0.05 | 0.03 | 0.01 | 0.01 |
| $F_y(c)$ | 0.75 | 0.90 | 0.95 | 0.98 | 0.99 | 1 |
$F_y(c)$ is the cumulative distribution function, $F_y(c) = P(y_i \leq c)$
$\text{E}[y_i] = \text{E}[\bar{y}] = \mu = 0.43$
Issue: right-skewness in $y_i$ affects the distribution of $\bar{y}$ for even moderate $n$
For regression with an intercept only, we base our confidence interval on multipliers from the $t_{n-1}$ distribution. The validity of these intervals hinges upon the fact that, if $\bar{y}$ is "close" to normally distributed:
\begin{align*} P\left(t_{0.025, n-1} \leq \frac{\bar{y} - \mu}{sd/\sqrt{n}} \leq t_{0.975, n-1}\right) &\approx 0.95 \end{align*}(Of course, this generalizes from 0.05 to any $\alpha$). Re-arranging:
\begin{align*} P\left(\bar{y} - t_{0.975, n-1}\frac{sd(y)}{\sqrt{n}} \leq \mu \leq \bar{y} - t_{0.025, n-1}\frac{sd(y)}{\sqrt{n}}\right) &\approx 0.95 \end{align*}This (and recognizing that $t_{0.025, n-1} = -t_{0.975, n-1}$ by symmetry of the $t$ distribution about zero) is where the form of our confidence interval comes from!
When we make a $t_{n-1}$-interval, we're saying that the $t_{n-1}$ can provide approximations to the 0.025 and 0.975 quantiles for $t_{stat} = (\bar{y} - \mu)/(sd/\sqrt{n})$. How well does this work here?
What if we could replace our intervals based on the $t$ distribution which were of the form
\begin{align*} \left[\bar{y} - t_{0.975, n-1}\frac{sd(y)}{\sqrt{n}}, \bar{y} - t_{0.025, n-1}\frac{sd(y)}{\sqrt{n}}\right] \end{align*}with intervals based on the actual distribution of $t_{stat}$ in the example at hand:
\begin{align*} \left[\bar{y} - q_{0.975}(t_{stat})\frac{sd(y)}{\sqrt{n}}, \bar{y} - q_{0.025}(t_{stat})\frac{sd(y)}{\sqrt{n}}\right] \end{align*}where $q_{0.975}(t_{stat})$ and $q_{0.025}(t_{stat})$ are quantiles from the true distribution of $t_{stat}$ across data sets? What would the coverage be?
If I had access to the true distribution for $y_i$, $F_y$, I could construct a confidence interval by doing the following $B$ times. For $b=1,\ldots,B$:
Take $n$ $iid$ samples from the distribution of $y_i$, $F_y$. Store the resulting sample as $\{y_1^*, y_2^*,\ldots,y_n^*\}$.
Find $\bar{y}^*$ and $sd(y^*)$, the mean and standard deviation of $\{y_1^*, y_2^*,\ldots,y_n^*\}$
Form $t^*_{stat,b} = \frac{\bar{y}^* - \mu}{sd(y^*)/\sqrt{n}}$. Store this value!
After this is done, find the quantiles $q_{.975}(t^*_{stat})$ and $q_{.025}(t^*_{stat})$ based on our $B$ simulated values, and use these to form confidence intervals.
The coverage would be 0.95! Let's see this in R...
(Note: this is called a Monte-Carlo approximation to the quantiles of $t_{stat}$)
While these intervals would be great, these aren't actionable since we don't know the true distribution of $t_{stat}$. Why? We don't know the population distribution for $y_i$!
If we knew the true distribution of $y_i$, we'd know $\mu$ as a consequence, so there'd be no point in constructing these intervals to begin with
We don't know the true distribution for $y_i$. Can we estimate it?
In our example: what's our best guess for $P(Y\leq 0)$ based on our data set of size $n$?
$\text{Prop}(Y\leq 0)$! The proportion of zeroes in our data set
Same for $P(Y\leq 1), \ldots, P(Y\leq 5)$
In practice, $F_y(c)$, the true distribution function, is unknown to us. What if we replace it using the proportions in our data?
| $c$ | 0 | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|---|
| $\hat{P}(y_i=c)$ | prop.table(table(yobs)) |
|||||
Red line: $P(y_i=c)$. Barplot: $\hat{P}(y_i=c)$
The empirical distribution function $\hat{F}_y$ based upon a sample $y = \{y_1,\ldots,y_n\}$ evaluated at a point $c$ counts how many observations were at or below that value $c$:
\begin{align*} \hat{F}_y(c) &= \text{Prop}(y_i \leq c)\\ &= \frac{1}{n}\sum_{i=1}^n \mathbf{1}\{y_i \leq c\} \end{align*}If $y_1,\ldots,y_n$ are $iid$ from a distribution with a cumulative distribution function $F_y$, then $\hat{F}_y$ estimates $F_y$.
\begin{align*} E(\hat{F}_y(c)) &= F_y(c) = P(y_i \leq c),\\ \hat{F}_y(c) &\overset{p}{\rightarrow} F_y(c) \text{ as } n\rightarrow \infty \end{align*}Our approach: let's use $\hat{F}_y$ as our best guess for $F_y$.
A bootstrap sample from a data set is an $iid$ sample of size $n$ from $\hat{F}_y$, the empirical distribution of $y$.
We treat the observed distribution in our data set as though it were the true distribution, and we take an $iid$ sample from this distribution.
For any type of variable (continuous or discrete), I can draw an $iid$ sample of size $n$ from $\hat{F}_y$,
the empirical distribution of a vector y, by running:
yboot = sample(y, n, replace = T)
That is, sampling $n$ times with replacement from a vector $y$ gives an $iid$ sample from the empirical distribution of $y$
Let $\{y_1^*, y_2^*,\ldots,y_n^*\}$ denote the resulting $iid$ sample of size $n$ from $\hat{F}_y$.
Suppose it were the case that the true distribution function were $\hat{F}_y$, and suppose I took an $iid$ sample, $y_1^*, y_2^*,\ldots,y_n^*$, from a distribution with CDF $\hat{F}_y$. Then, given $y_1,\ldots,y_n$,
$\text{E}[y_i^*] = \text{E}[\bar{y}^*] = \bar{y}$ (the observed sample mean in my original sample $\{y_1,\ldots,y_n\}$)
Sample statistics from our observed sample become population parameters when bootstrapping.
Across bootstrap resamples...
$\hat{F}_y$ becomes the true population distribution for $y_i^*$
Any population parameter within the bootstrap framework is derived from $\hat{F}_y$.
In real life, in my single sample $y_1,\ldots,y_n$
$\hat{F}_y$ estimates $F_y$
$\bar{y}$ estimates $\mu$
$sd(y)$ estimates $\sigma$
If I then take a bootstrap sample of $y_1,\ldots,y_n$, for any single bootstrap sample $y_1^*,\ldots,y_n^*$
The empirical distribution of $y_i^*$, $\hat{F}_{y^*}$, estimates $\hat{F}_y$
$\bar{y}^*$ estimates $\bar{y}$
$sd(y^*)$ estimates $sd(y)$
Importantly
Can I generate multiple data sets with distribution $F_y$? NO
Can I generate multiple data sets with distribution $\hat{F}_y$? YES
For any bootstrap $y_1^*,\ldots,y_n^*$, I could form the value:
\begin{align*} t_{stat}^* &= \frac{\bar{y}^* - \bar{y}}{sd(y^*)/\sqrt{n}} \end{align*}where $sd(y^*)$ is the sample standard deviation of $y_1^*,\ldots,y_n^*$.
Average of the observed bootstrap sample, minus the expectation across bootstrap samples from $\{y_1,\ldots,y_n\}$, over the standard error based on the bootstrap resamples.
The benefit: unlike $F_y$, I have access to $\hat{F}_y$: it's defined based on my sample!
I can actually generate multiple data sets according to the distribution $\hat{F}_y$, just by running
sample(x, n, replace = T) multiple times.
I can then actually calculate the quantiles for $t_{stat}^*$
I wanted to form intervals based on the actual distribution of $t_{stat}$ in the example at hand:
\begin{align*} \left[\bar{y} - q_{0.975}(t_{stat})\frac{sd(y)}{\sqrt{n}}, \bar{y} - q_{0.025}(t_{stat})\frac{sd(y)}{\sqrt{n}}\right] \end{align*}where $q_{0.975}(t_{stat})$ and $q_{0.025}(t_{stat})$ were quantiles from the true distribution of $t_{stat}$.
Well, we can't do that. But what if we formed intervals of the form:
\begin{align*} \left[\bar{y} - q_{0.975}(t^*_{stat})\frac{sd(y)}{\sqrt{n}}, \bar{y} - q_{0.025}(t^*_{stat})\frac{sd(y)}{\sqrt{n}}\right] \end{align*}where $q_{0.025}(t^*_{stat})$ and $q_{0.975}(t^*_{stat})$ are the 0.025 and 0.975 quantiles based on $t^*_{stat}$.
These are our estimates for the true quantiles found by sampling from $\hat{F}_y$ instead of $F_y$.
For a given data set, $\{y_1,\ldots,y_n\}$, with observed mean $\bar{y}$, I do the following $B$ times. For $b=1,\ldots,B$:
Sample $n$ times with replacement from the values $\{y_1,y_2,\ldots,y_n\}$. Store the resulting sample as $\{y_1^*, y_2^*,\ldots,y_n^*\}$.
Find $\bar{y}^*$ and $sd(y^*)$, the mean and standard deviation of $\{y_1^*, y_2^*,\ldots,y_n^*\}$
Form $t^*_{stat,b} = \frac{\bar{y}^* - \bar{y}}{sd(y^*)/\sqrt{n}}$. Store this value!
After this is done, find the quantiles $q_{.975}(t^*_{stat})$ and $q_{.025}(t^*_{stat})$ based on our $B$ simulated values, and use these to form confidence intervals.
We are hence trying to approximate the distribution of:
\begin{align*} t_{stat} &= \frac{\bar{y} - \mu}{sd(y)/\sqrt{n}} \end{align*}by that of:
\begin{align*} t^*_{stat} &= \frac{\bar{y}^* - \bar{y}}{sd(y^*)/\sqrt{n}}, \end{align*}instead of simply asserting that it is $t$-distributed, like we did before. Let's see how well it works in our example through simulation...
To pull oneself up by one's bootstraps means "to succeed only on one's own efforts and abilities"
Rather than employing quantiles based on the $t$-distribution, we use simulation to try to estimate these quantiles!
We proceed based only on our efforts and abilities - using evidence presented in the data at hand, rather than postulating that the distribution of $t_{stat}$ is "hopefully $t$"
Method developed by Prof. Brad Efron at Stanford. Published in 1979 in a paper entitled "The Bootstrap: Another Look at the Jackknife."
Can be shown to work under very mild conditions
Works with continuous or discrete variables
Can also be used to create confidence intervals for quantities other than the population mean (for example, the population median, or the population slope)