STATS 413: Lecture 16

Non-Gaussian/normal error terms
Asymptotically valid inference under non-normality
The bootstrap

Addressing Departures from Assumptions

In the first part of this course, we

Derived certain consequences of the ordinary least squares slope coefficients under the assumptions of the weaker and stronger linear models
Discussed diagnostics to assess whether these assumptions seem reasonable

Addressing Departures from Assumptions

Now: What happens if certain assumptions seem unreasonable? Can we proceed? If so, how?

What if the true trend doesn't seem linear? (Coming soon)
What happens if the data is actually heteroskedastic? (Lecture 14)
What if normality seems violated? (Today)

Reminder: Dealing with Heteroskedasticity

Consider our model for heteroskedasticity:

\begin{align*} y = X\beta + \varepsilon;\;\; E(\varepsilon) = 0;\;\;\text{Var}(\varepsilon) = \sigma^2_\varepsilon W^{-1}. \end{align*}

where $\varepsilon_1,\ldots,\varepsilon_n$ are independent, and $W$ is a $n\times n$ diagonal matrix with $w_{ii} > 0$ on the diagonals (and zeroes on the off-diagonals).

Reminder: Dealing with Heteroskedasticity

Under this model, is the OLS estimate $\hat{\beta}$ unbiased for $\beta$?
If we know $W$, should we still use ordinary least squares to estimate $\beta$? If not, what estimator should we use instead?
What was the justification for using this estimator instead of ordinary least squares?
If we don't know $W$, can we come up with new standard errors for $\hat{\beta}_{OLS}$ that allow us to conduct inference? What were these called?

Departures from Normality

We'll now assess the impact of departures from normality on inference

If $\varepsilon$ isn't actually normally distributed, how do my hypothesis tests and confidence intervals perform?
Do the methods for inference perform better as the sample size increases?
Is there an alternative approach to inference when normality is violated?

Inference under the Stronger Linear Model

Under the stronger linear model,

\begin{align*} y = X\beta + \varepsilon;\;\; \varepsilon \sim MVN(0, \sigma^2_\varepsilon I), \end{align*}

Inference under the Stronger Linear Model

we obtained the result that

\begin{align*} \frac{\hat{\beta}_j - \beta_j}{\text{se}(\hat{\beta}_j)} &\sim t_{n-p-1}, \end{align*}

where $\hat{\beta}_j$ is the OLS slope coefficient, and $\text{se}(\hat{\beta}_j)$ is the conventional standard error estimate returned by R.

What happens when $\varepsilon$ is not multivariate normal?

Consequences of Non-Normality

When $\varepsilon$ isn't normally distributed, we can't guarantee that $\hat{\beta}_j$ is normally distributed for all sample sizes $n$

Hypothesis tests using the $t_{n-p-1}$ to compute $p$-values can have Type I error rates larger than or smaller than $\alpha$
Confidence intervals using quantiles from a $t_{n-p-1}$ might have actual coverage that is larger than or smaller than $100(1-\alpha)\%$.

Consequences of Non-Normality

As long as $n$ is large, $\widehat{\beta}_j$ turns out to be approximately normally distributed.

Recall the central limit theorem from your introductory statistics class: Suppose $y_1,\ldots,y_n$ are $iid$ with $E(y_i) = \mu$, $\text{Var}(y_i) = \sigma^2 < \infty$. Then, regardless of the distribution of $y_i$, as $n\rightarrow \infty$

\begin{align*} \frac{\bar{y} - \mu}{\sigma/\sqrt{n}} &\overset{d}{\rightarrow} N(0, 1). \end{align*}

Large-Sample Distribution for Regression Coefficients

Consider our model for heteroskedasticity:

\begin{align*} y = X\beta + \varepsilon;\;\; E(\varepsilon) = 0;\;\;\text{Var}(\varepsilon) = \sigma^2_\varepsilon W^{-1}. \end{align*}

where $\varepsilon_1,\ldots,\varepsilon_n$ are independent, and $W$ is a $n\times n$ diagonal matrix with $w_{ii} > 0$ on the diagonals (and zeroes on the off-diagonals). Let $\hat{\beta}$ be the OLS slope coefficients, and recall that

\begin{align*} \text{Var}(\hat{\beta}) &= \sigma^2_\varepsilon (X^\top X)^{-1}X^\top W^{-1}X(X^\top X)^{-1} \end{align*}

Large-Sample Distribution for Regression Coefficients

We will now consider the distribution of

\begin{align*} (1)\; \frac{\hat{\beta}_j - \beta_j}{\text{SD}(\hat{\beta}_j)};\;\; (2)\; \frac{\hat{\beta}_j - \beta_j}{\text{se}_{HC2}(\hat{\beta}_j)};\;\; (3)\; \frac{\hat{\beta}_j - \beta_j}{\text{se}(\hat{\beta}_j)}, \end{align*}

where $\text{SD}(\hat{\beta}_j) = \sqrt{\text{Var}(\hat{\beta}_j)}$, $\text{se}(\cdot)$ is the conventional standard error returned by R, and $\text{se}_{HC2}(\cdot)$ is the heteroskedasticity-consistent standard error.

Central Limit Theorems for Regression Coefficients

Asymptotic Normality

Under our model for heteroskedasticity and under suitable conditions on $\varepsilon$ and $X$, as $n\rightarrow \infty$

\begin{align*} \frac{\hat{\beta}_j - \beta_j}{\text{SD}(\hat{\beta}_j)} &\overset{d}{\rightarrow} N(0,1);\;\;\;\;\; \frac{\hat{\beta}_j - \beta_j}{\text{se}_{HC2}(\hat{\beta}_j)} \overset{d}{\rightarrow} N(0,1) \end{align*}

Central Limit Theorems for Regression Coefficients

Asymptotic Normality

If homoskedasticity holds $(W=I)$, then we further have that as $n\rightarrow \infty$

\begin{align*} \frac{\hat{\beta}_j - \beta_j}{\text{se}(\hat{\beta}_j)} &\overset{d}{\rightarrow} N(0,1) \end{align*}

Central Limit Theorems for Regression Coefficients

Under heteroskedasticity, $\frac{\hat{\beta}_j - \beta_j}{\text{se}(\hat{\beta}_j)}$ will still be asymptotically normally distributed with mean zero, but with a variance that does not equal 1.

Practical Implications

Because $\hat{\beta}_j$ is asymptotically normal, confidence intervals and hypothesis tests using the $t$ distribution will also be asymptotically valid as $n\rightarrow \infty$

Under homoskedasticity: we can use the usual standard errors from R
Under heteroskedasticity: need to replace the usual standard errors with $HC$ standard errors.

Practical Implications

In practice: for $n$ reasonably large, methods for inference using the $t$ will be roughly correct so long as we use a valid standard error.

Confidence intervals will have coverage close to $100(1-\alpha)\%$ for $n$ "large enough"
Hypothesis tests will have Type I error rate close to $\alpha$ for $n$ "large enough"

How large is "large enough" depends on distribution of $\varepsilon$. More skewness, need larger $n$. Symmetric, smaller $n$ required.

Limitations of Central Limit Theorem

While asymptotic normality provides an assurance that things won't go too far off the rails when $n$ is large enough, we generally don't know how suitable the approximation will be for any particular data set.

Since we don't know the distribution of $\varepsilon$, we can't use simulation to assess how close to normal the distribution of $\hat{\beta}$ will be

Limitations of Central Limit Theorem

Is there a better way to proceed? Can we try to use the data itself to approximate the distribution of $t_{stat}$, rather than relying on the central limit theorem?

Yes, through a method called the bootstrap.

An Example without Covariates

We'll now build intuition for the bootstrap through a simple example without covariates (we will consider inference for the population mean). Suppose that $y_i$, $i=1,\ldots,50$, are $iid$ draws from the following discrete distribution

$c$	0	1	2	3	4	5
$P(y_i = c)$	0.75	0.15	0.05	0.03	0.01	0.01
$F_y(c)$	0.75	0.90	0.95	0.98	0.99	1

An Example without Covariates

$F_y(c)$ is the cumulative distribution function, $F_y(c) = P(y_i \leq c)$
$\text{E}[y_i] = \text{E}[\bar{y}] = \mu = 0.43$

Issue: right-skewness in $y_i$ affects the distribution of $\bar{y}$ for even moderate $n$

When Does the $t$ Interval Work?

For regression with an intercept only, we base our confidence interval on multipliers from the $t_{n-1}$ distribution. The validity of these intervals hinges upon the fact that, if $\bar{y}$ is "close" to normally distributed:

\begin{align*} P\left(t_{0.025, n-1} \leq \frac{\bar{y} - \mu}{sd/\sqrt{n}} \leq t_{0.975, n-1}\right) &\approx 0.95 \end{align*}

When Does the $t$ Interval Work?

(Of course, this generalizes from 0.05 to any $\alpha$). Re-arranging:

\begin{align*} P\left(\bar{y} - t_{0.975, n-1}\frac{sd(y)}{\sqrt{n}} \leq \mu \leq \bar{y} - t_{0.025, n-1}\frac{sd(y)}{\sqrt{n}}\right) &\approx 0.95 \end{align*}

This (and recognizing that $t_{0.025, n-1} = -t_{0.975, n-1}$ by symmetry of the $t$ distribution about zero) is where the form of our confidence interval comes from!

The Distribution of $t_{stat}$ in this Example

When we make a $t_{n-1}$-interval, we're saying that the $t_{n-1}$ can provide approximations to the 0.025 and 0.975 quantiles for $t_{stat} = (\bar{y} - \mu)/(sd/\sqrt{n})$. How well does this work here?

The Distribution of $t_{stat}$ in this Example

What Would be Ideal?

What if we could replace our intervals based on the $t$ distribution which were of the form

\begin{align*} \left[\bar{y} - t_{0.975, n-1}\frac{sd(y)}{\sqrt{n}}, \bar{y} - t_{0.025, n-1}\frac{sd(y)}{\sqrt{n}}\right] \end{align*}

What Would be Ideal?

with intervals based on the actual distribution of $t_{stat}$ in the example at hand:

\begin{align*} \left[\bar{y} - q_{0.975}(t_{stat})\frac{sd(y)}{\sqrt{n}}, \bar{y} - q_{0.025}(t_{stat})\frac{sd(y)}{\sqrt{n}}\right] \end{align*}

where $q_{0.975}(t_{stat})$ and $q_{0.025}(t_{stat})$ are quantiles from the true distribution of $t_{stat}$ across data sets? What would the coverage be?

Ideal Confidence Intervals

If I had access to the true distribution for $y_i$, $F_y$, I could construct a confidence interval by doing the following $B$ times. For $b=1,\ldots,B$:

Take $n$ $iid$ samples from the distribution of $y_i$, $F_y$. Store the resulting sample as $\{y_1^*, y_2^*,\ldots,y_n^*\}$.
Find $\bar{y}^*$ and $sd(y^*)$, the mean and standard deviation of $\{y_1^*, y_2^*,\ldots,y_n^*\}$
Form $t^*_{stat,b} = \frac{\bar{y}^* - \mu}{sd(y^*)/\sqrt{n}}$. Store this value!

Ideal Confidence Intervals

After this is done, find the quantiles $q_{.975}(t^*_{stat})$ and $q_{.025}(t^*_{stat})$ based on our $B$ simulated values, and use these to form confidence intervals.

The coverage would be 0.95! Let's see this in R...

(Note: this is called a Monte-Carlo approximation to the quantiles of $t_{stat}$)

What's the Hold-Up?

While these intervals would be great, these aren't actionable since we don't know the true distribution of $t_{stat}$. Why? We don't know the population distribution for $y_i$!

If we knew the true distribution of $y_i$, we'd know $\mu$ as a consequence, so there'd be no point in constructing these intervals to begin with

What's the Hold-Up?

We don't know the true distribution for $y_i$. Can we estimate it?

In our example: what's our best guess for $P(Y\leq 0)$ based on our data set of size $n$?
$\text{Prop}(Y\leq 0)$! The proportion of zeroes in our data set
Same for $P(Y\leq 1), \ldots, P(Y\leq 5)$

The Plug-in Principle

In practice, $F_y(c)$, the true distribution function, is unknown to us. What if we replace it using the proportions in our data?

$c$	0	1	2	3	4	5
$\hat{P}(y_i=c)$	`prop.table(table(yobs))`

The Plug-in Principle

Red line: $P(y_i=c)$. Barplot: $\hat{P}(y_i=c)$

Empirical Distribution

The empirical distribution function $\hat{F}_y$ based upon a sample $y = \{y_1,\ldots,y_n\}$ evaluated at a point $c$ counts how many observations were at or below that value $c$:

\begin{align*} \hat{F}_y(c) &= \text{Prop}(y_i \leq c)\\ &= \frac{1}{n}\sum_{i=1}^n \mathbf{1}\{y_i \leq c\} \end{align*}

Empirical Distribution

If $y_1,\ldots,y_n$ are $iid$ from a distribution with a cumulative distribution function $F_y$, then $\hat{F}_y$ estimates $F_y$.

\begin{align*} E(\hat{F}_y(c)) &= F_y(c) = P(y_i \leq c),\\ \hat{F}_y(c) &\overset{p}{\rightarrow} F_y(c) \text{ as } n\rightarrow \infty \end{align*}

Our approach: let's use $\hat{F}_y$ as our best guess for $F_y$.

Bootstrap Resampling

A bootstrap sample from a data set is an $iid$ sample of size $n$ from $\hat{F}_y$, the empirical distribution of $y$.

We treat the observed distribution in our data set as though it were the true distribution, and we take an $iid$ sample from this distribution.

Bootstrap Resampling

For any type of variable (continuous or discrete), I can draw an $iid$ sample of size $n$ from $\hat{F}_y$, the empirical distribution of a vector y, by running: yboot = sample(y, n, replace = T)

That is, sampling $n$ times with replacement from a vector $y$ gives an $iid$ sample from the empirical distribution of $y$

Let $\{y_1^*, y_2^*,\ldots,y_n^*\}$ denote the resulting $iid$ sample of size $n$ from $\hat{F}_y$.

Treating $\hat{F}_y$ Like ${F}_y$

Suppose it were the case that the true distribution function were $\hat{F}_y$, and suppose I took an $iid$ sample, $y_1^*, y_2^*,\ldots,y_n^*$, from a distribution with CDF $\hat{F}_y$. Then, given $y_1,\ldots,y_n$,

$\text{E}[y_i^*] = \text{E}[\bar{y}^*] = \bar{y}$ (the observed sample mean in my original sample $\{y_1,\ldots,y_n\}$)

Treating $\hat{F}_y$ Like ${F}_y$

Sample statistics from our observed sample become population parameters when bootstrapping.

Across bootstrap resamples...

$\hat{F}_y$ becomes the true population distribution for $y_i^*$
Any population parameter within the bootstrap framework is derived from $\hat{F}_y$.

Bootstrap Analogues

In real life, in my single sample $y_1,\ldots,y_n$

$\hat{F}_y$ estimates $F_y$
$\bar{y}$ estimates $\mu$
$sd(y)$ estimates $\sigma$

Bootstrap Analogues

If I then take a bootstrap sample of $y_1,\ldots,y_n$, for any single bootstrap sample $y_1^*,\ldots,y_n^*$

The empirical distribution of $y_i^*$, $\hat{F}_{y^*}$, estimates $\hat{F}_y$
$\bar{y}^*$ estimates $\bar{y}$
$sd(y^*)$ estimates $sd(y)$

Bootstrap Analogues

Importantly

Can I generate multiple data sets with distribution $F_y$? NO
Can I generate multiple data sets with distribution $\hat{F}_y$? YES

Estimating the Distribution of $t_{stat}$

For any bootstrap $y_1^*,\ldots,y_n^*$, I could form the value:

\begin{align*} t_{stat}^* &= \frac{\bar{y}^* - \bar{y}}{sd(y^*)/\sqrt{n}} \end{align*}

where $sd(y^*)$ is the sample standard deviation of $y_1^*,\ldots,y_n^*$.

Average of the observed bootstrap sample, minus the expectation across bootstrap samples from $\{y_1,\ldots,y_n\}$, over the standard error based on the bootstrap resamples.

Estimating the Distribution of $t_{stat}$

The benefit: unlike $F_y$, I have access to $\hat{F}_y$: it's defined based on my sample!

I can actually generate multiple data sets according to the distribution $\hat{F}_y$, just by running sample(x, n, replace = T) multiple times.
I can then actually calculate the quantiles for $t_{stat}^*$

Estimating the Quantiles of $t_{stat}$

I wanted to form intervals based on the actual distribution of $t_{stat}$ in the example at hand:

\begin{align*} \left[\bar{y} - q_{0.975}(t_{stat})\frac{sd(y)}{\sqrt{n}}, \bar{y} - q_{0.025}(t_{stat})\frac{sd(y)}{\sqrt{n}}\right] \end{align*}

where $q_{0.975}(t_{stat})$ and $q_{0.025}(t_{stat})$ were quantiles from the true distribution of $t_{stat}$.

Estimating the Quantiles of $t_{stat}$

Well, we can't do that. But what if we formed intervals of the form:

\begin{align*} \left[\bar{y} - q_{0.975}(t^*_{stat})\frac{sd(y)}{\sqrt{n}}, \bar{y} - q_{0.025}(t^*_{stat})\frac{sd(y)}{\sqrt{n}}\right] \end{align*}

where $q_{0.025}(t^*_{stat})$ and $q_{0.975}(t^*_{stat})$ are the 0.025 and 0.975 quantiles based on $t^*_{stat}$.

These are our estimates for the true quantiles found by sampling from $\hat{F}_y$ instead of $F_y$.

The Bootstrap for $t$ Statistics

For a given data set, $\{y_1,\ldots,y_n\}$, with observed mean $\bar{y}$, I do the following $B$ times. For $b=1,\ldots,B$:

Sample $n$ times with replacement from the values $\{y_1,y_2,\ldots,y_n\}$. Store the resulting sample as $\{y_1^*, y_2^*,\ldots,y_n^*\}$.
Find $\bar{y}^*$ and $sd(y^*)$, the mean and standard deviation of $\{y_1^*, y_2^*,\ldots,y_n^*\}$
Form $t^*_{stat,b} = \frac{\bar{y}^* - \bar{y}}{sd(y^*)/\sqrt{n}}$. Store this value!

The Bootstrap for $t$ Statistics

After this is done, find the quantiles $q_{.975}(t^*_{stat})$ and $q_{.025}(t^*_{stat})$ based on our $B$ simulated values, and use these to form confidence intervals.

The Bootstrap for $t$ statistics

We are hence trying to approximate the distribution of:

\begin{align*} t_{stat} &= \frac{\bar{y} - \mu}{sd(y)/\sqrt{n}} \end{align*}

by that of:

\begin{align*} t^*_{stat} &= \frac{\bar{y}^* - \bar{y}}{sd(y^*)/\sqrt{n}}, \end{align*}

instead of simply asserting that it is $t$-distributed, like we did before. Let's see how well it works in our example through simulation...

The Bootstrap

To pull oneself up by one's bootstraps means "to succeed only on one's own efforts and abilities"

Rather than employing quantiles based on the $t$-distribution, we use simulation to try to estimate these quantiles!

We proceed based only on our efforts and abilities - using evidence presented in the data at hand, rather than postulating that the distribution of $t_{stat}$ is "hopefully $t$"

The Bootstrap

Method developed by Prof. Brad Efron at Stanford. Published in 1979 in a paper entitled "The Bootstrap: Another Look at the Jackknife."

Can be shown to work under very mild conditions
Works with continuous or discrete variables
Can also be used to create confidence intervals for quantities other than the population mean (for example, the population median, or the population slope)