Heteroskedasticity, estimation, and inference
Weighted least squares
Heteroskedasticity-consistent standard errors
In the first part of this course, we
Derived certain consequences of the ordinary least squares slope coefficients under the assumptions of the weaker and stronger linear models
Discussed diagnostics to assess whether these assumptions seem reasonable
Now: What happens if certain assumptions seem unreasonable? Can we proceed? If so, how?
What if the true trend doesn't seem linear?
What happens if the data is actually heteroskedastic?
What if normality seems violated?
We'll begin by assessing the implications of heteroskedasticity for estimation and inference (while deferring discussion of prediction intervals):
If the data are heteroskedastic, what happens to the expectation and variance of the ordinary least squares solution $\widehat{\beta}$? Are the usual standard errors still valid?
If the data are heteroskedastic and I know $\text{Var}(\varepsilon)$, should I still use $\widehat{\beta}$ as an estimator?
If the data are heteroskedastic but I don't know $\text{Var}(\varepsilon)$, can I find new standard errors for $\widehat{\beta}_j$ that work under heteroskedasticity?
Suppose the weaker linear model holds:
\begin{align*} y &= X\beta + \varepsilon;\;\; \text{E}(\varepsilon) = 0;\;\; \text{Var}(\varepsilon) = \sigma^2_\varepsilon I_{n\times n} \end{align*}Let $\widehat{\beta}$ be the ordinary least squares solution,
\begin{align*} \widehat{\beta} = \arg\min\nolimits_\beta (y-X\beta)^\top(y-X\beta) &= (X^\top X)^{-1}X^\top y \end{align*}We have shown that under the weaker linear model,
$\text{E}(\widehat{\beta}) = \beta$ (unbiasedness)
$\text{Var}(\widehat{\beta}) = \sigma^2_\varepsilon (X^\top X)^{-1}$
We have data on the number of workers and the number of supervisors in 27 factories across the United States:
When discussing heteroskedasticity, we will consider the following generative model:
\begin{align*} y = X\beta + \varepsilon;\;\; \text{E}(\varepsilon) = 0;\;\;\text{Var}(\varepsilon) = \sigma^2_\varepsilon W^{-1}. \end{align*}where
$\varepsilon_1,...,\varepsilon_n$ are independent
$W$ is a $n\times n$ diagonal matrix with $w_{ii} > 0$ on the diagonals (and zeroes on the off-diagonals).
Implies $W^{-1}$ is an $n\times n$ diagonal, with $1/w_{ii}$ on the diagonals and zeroes on the off-diagonals
Note that $W = W^{-1} = I_{n\times n}$ implies homoskedasticity.
Let $\widehat{\beta}$ be the ordinary least squares solution, and suppose the previous model for heteroskedasticity holds. Then:
\begin{align*} \text{E}(\widehat{\beta}) &= \beta\\ \text{Var}(\widehat{\beta}) &= \sigma^2_\varepsilon (X^\top X)^{-1}X^\top W^{-1}X(X^\top X)^{-1}\\ &\neq \sigma^2_\varepsilon (X^\top X)^{-1} \end{align*}Consequences
$\widehat{\beta}$ is still unbiased for $\beta$ under heteroskedasticity
The usual standard errors reported by R for $\widehat{\beta}_j$ are
invalid
under heteroskedasticity. They can be too large or too small.
Even if $\varepsilon$ is multivariate normal, the usual hypothesis tests and confidence intervals will also be invalid!
lm(supervisors~workers)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.44806 9.56201 1.511 0.143
workers 0.10536 0.01133 9.303 1.35e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.73 on 25 degrees of freedom
Multiple R-squared: 0.7759, Adjusted R-squared: 0.7669
F-statistic: 86.54 on 1 and 25 DF, p-value: 1.35e-09
Can trust the slope coefficient estimates themselves under heteroskedasticity, but not the standard errors, and not any of the $p$-values shown here!
We'll now consider two approaches to dealing with heteroskedasticity, depending upon whether we know the value of $W$.
If $W$ is known, we can construct an estimator that actually improves upon the ordinary least squares solution, through a technique called weighted least squares
If $W$ is unknown, we can construct standard errors for the OLS solution $\widehat{\beta}$ that are valid even under heteroskedasticity. These are called heteroskedasticity-consistent (HC) standard errors.
Let's assume that $W$ is known. Recall that $W^{-1}$ is a diagonal matrix with $w_{ii}^{-1}$ on its diagonals:
\begin{align*} y = X\beta + \varepsilon;\;\; \text{E}(\varepsilon) = 0;\;\;\text{Var}(\varepsilon) = \sigma^2_\varepsilon W^{-1}, \end{align*}Let $W^{1/2}$ be a diagonal matrix with $w_{ii}^{1/2} = \sqrt{w_{ii}}$ on the diagonals. Consider the transformation:
\begin{align*} W^{1/2}y = W^{1/2}\left(X\beta +\varepsilon\right). \end{align*}$(W^{1/2}y)_i = \sqrt{w_{ii}}\times y_i$
$(W^{1/2}X)_{ij} = \sqrt{w_{ii}}\times x_{ij}$
Homoskedastic After Transformation
\begin{align*} \text{Var}(W^{1/2}y) = \text{Var}(W^{1/2}\varepsilon) = \sigma^2_\varepsilon I_{n\times n} \end{align*}Consider running an ordinary least squares regression of the transformed responses, $W^{1/2}y$, on the transformed predictors, $W^{1/2}X$.
\begin{align*} \widehat{\beta}_{WLS} &= \arg\min_\beta (W^{1/2}y-W^{1/2}X\beta)^\top(W^{1/2}y-W^{1/2}X\beta)\\ &= \arg\min_\beta (y-X\beta)^\top W(y-X\beta)\\ &=\textstyle \arg\min_\beta \sum_{i=1}^nw_{ii}(y_i-x_i^\top\beta)^2\\ &= (X^\top WX)^{-1}X^\top Wy \end{align*}This is called the weighted least squares solution. The terms $w_{ii}$ weight the individual observations
Larger $w_{ii}$: bigger penalty on $(y_i-x_i^\top\beta)^2$. Stronger emphasis on fit for individual $i$.
The weighted least square solution is $\widehat{\beta}_{WLS} = (X^\top WX)^{-1}X^\top Wy$
Expectation and Variance of $\widehat{\beta}_{WLS}$
\begin{align*} \text{E}(\widehat{\beta}_{WLS}) &= \beta\\ \text{Var}(\widehat{\beta}_{WLS}) &= \sigma^2_\varepsilon (X^\top WX)^{-1} \end{align*}Comparing these results with what we obtained for the ordinary least squares solution, $\widehat{\beta}_{OLS} = (X^\top X)^{-1}X^\top y$, under heteroskedasticity:
Same expectation (both are unbiased for $\beta$)
Different variance!
\begin{align*} \text{Var}(\widehat{\beta}_{OLS}) &= \sigma^2_\varepsilon (X^\top X)^{-1}X^\top W^{-1}X(X^\top X)^{-1}\\ &\neq \sigma^2_\varepsilon (X^\top WX)^{-1} \end{align*}$\widehat{\beta}_{OLS}$ and $\widehat{\beta}_{WLS}$ are both unbiased for $\beta$, but have different variances. Which should we prefer?
Under this generative model for $y$, it turns out that $\widehat{\beta}_{WLS}$ is the best linear unbiased estimator (BLUE) of $\beta$.
Linear: estimators $\tilde{\beta}$ that can be written in the form $\tilde{\beta} = C y$ for some matrix $C$
Unbiased: estimators $\tilde{\beta}$ such that $\text{E}(\tilde{\beta}) = \beta$
Best: For any vector of constants $a$, $\text{Var}(a^\top\tilde{\beta})$ has the smallest possible variance out of all unbiased linear estimators.
Gauss-Markov Theorem
Let $\widehat{\beta}_{WLS}$ be the weighted least squares solution. Suppose the above generative model and that $X$ is full rank. Let $\tilde{\beta}$ be another linear unbiased estimator of $\beta$. Then for any constant vector $a \in \mathbb{R}^{p+1}$
\begin{align*} \text{Var}(a^\top\widehat{\beta}_{WLS}) \leq \text{Var}(a^\top\tilde{\beta}) \end{align*}Note that if $W = W^{-1} = I$, $\widehat{\beta}_{WLS} = \widehat{\beta}_{OLS}$
Implication: Under homoskedasticity, $\widehat{\beta}_{OLS}$ is the best linear unbiased estimator for $\beta$
Provides additional justification for using OLS to estimate $\beta$ under the weaker and stronger linear model
Any other linear, unbiased estimator would be worse than $\widehat{\beta}_{OLS}$
In the general case where $W\neq I$, we see that $\widehat{\beta}_{WLS}$ is preferred to $\widehat{\beta}_{OLS}$
Take $\tilde{D} = (X^\top X)^{-1}X^\top$ in the proof, and you see that $\widehat{\beta}_{OLS}$ has larger variance than $\widehat{\beta}_{WLS}$.
Let $a = (a_0,...,a_p)^\top$ have a one in the $j$th entry, zero otherwise, you see that $\text{Var}(\widehat{\beta}_{OLS,j}) \geq \text{Var}(\widehat{\beta}_{WLS, j})$
See the end of the lecture deck for a proof of Gauss-Markov
Recall that our model for heteroskedasticity states
\begin{align*} \text{Var}(\varepsilon) &= \sigma^2_\varepsilon W^{-1}, \end{align*}so that the variances of $\varepsilon_i$ are proportional to the inverse of the weights.
Small $\text{var}(\varepsilon_i)$: larger weight for that observation when determining $\widehat{\beta}_{WLS}$.
Recall that our model for heteroskedasticity states
\begin{align*} \text{Var}(\varepsilon) &= \sigma^2_\varepsilon W^{-1}, \end{align*}so that the variances of $\varepsilon_i$ are proportional to the inverse of the weights.
Large $\text{var}(\varepsilon_i)$: smaller weight for that observation when determining $\widehat{\beta}_{WLS}$.
We have so far assumed that $W$ is known, but have made no assumption that $\sigma^2_\varepsilon$ is known
\begin{align*} \text{Var}(\varepsilon) &= \sigma^2_\varepsilon W^{-1}, \end{align*}If it is unknown, we can estimate it as follows
\begin{align*} \widehat{\sigma}^2_{\varepsilon, WLS} &= \frac{(y-X\widehat{\beta}_{WLS})^\top W(y-X\widehat{\beta}_{WLS})}{n-p-1} \end{align*}We can then estimate $\text{Var}(\widehat{\beta}_{WLS})$ by
\begin{align*} \widehat{V}(\widehat{\beta}_{WLS}) &= \widehat{\sigma}^2_{\varepsilon, WLS}(X^\top WX)^{-1} \end{align*}From here, we can derive standard errors for $\widehat{\beta}_{j, WLS}$ as the square roots of the diagonals of $\widehat{V}(\widehat{\beta}_{WLS})$.
In order to run weighted least squares, we need to know the value for $W$, as $\widehat{\beta}_{WLS} = (X^\top WX)^{-1}X^\top Wy$. Is this reasonable? Sometimes...
If each observation $i$ represents the average response among $n_i$ equally weighted responses in a given group, then $w_{ii} = n_i$
In order to run weighted least squares, we need to know the value for $W$, as $\widehat{\beta}_{WLS} = (X^\top WX)^{-1}X^\top Wy$. Is this reasonable? Sometimes...
If it's known that the standard deviation is proportional to a predictor variable $x_i$, then $w_{ii} = 1/x_i^2$
If we simply view $W$ as a weight matrix, without necessarily believing $\text{Var}(\varepsilon) = \sigma^2_\varepsilon W^{-1}$, weighted least squares can account for different priorities on performance of predictions at different covariates $x_i$.
\begin{align*}\textstyle \widehat{\beta}_{WLS} = \arg\min_\beta \sum_{i=1}^nw_{ii}(y_i-x_i^\top\beta)^2 \end{align*}If I care more about accuracy of predictions at certain values of $x_i$, then I can place a larger weight $w_{ii}$ on the residuals for those observations
There could be predictor variables that encode sensitive attributes, or reflect observations of high value to stakeholders
Under this use case for weighted least squares, the Gauss-Markov justification for WLS goes out the window. You are instead arguing that the loss function underpinning WLS better aligns with your priorities.
In our example regressing supervisors on workers, it appears that the error standard deviation is proportional to the number of workers. Remembering that $\text{Var}(\varepsilon_i) = \sigma^2_\varepsilon/w_{ii}$
\begin{align*} w_{ii} &= 1/(\text{workers}_i)^2 \end{align*}lm(supervisors~workers, weights = 1/workers^2)
lm(formula = supervisors ~ workers, weights = 1/workers^2)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.803296 4.569745 0.832 0.413
workers 0.120990 0.008999 13.445 6.04e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.02266 on 25 degrees of freedom
Multiple R-squared: 0.8785, Adjusted R-squared: 0.8737
F-statistic: 180.8 on 1 and 25 DF, p-value: 6.044e-13
If in reality $\sigma^2_i \propto \text{workers}_i^2$, then we can trust these standard errors and $p$-values.
If we don't know $W$ (and hence, we don't know $\text{Var} (\varepsilon)$), there are two potential paths to explore
Simply use $\widehat{\beta}_{OLS}$, which doesn't require a weight matrix, and come up with standard error estimates that are robust to heteroskedasticity
If we don't know $W$ (and hence, we don't know $\text{Var}(\varepsilon)$), there are two potential paths to explore
Try to estimate $\text{Var}(\varepsilon)$ using the residuals from ordinary least squares. Then, use the inverse of the estimated variances as weights
Under homoskedasticity, we based standard errors off of
\begin{align*} \widehat{V}(\widehat{\beta}) &= \widehat{\sigma}^2_\varepsilon(X^\top X)^{-1}, \end{align*}which was based upon the fact that under homoskedasticity, $\text{Var}(\widehat{\beta}) = \sigma^2_\varepsilon(X^\top X)^{-1}$.
Under heteroskedasticity, this variance estimate is generally inconsistent (even under favorable conditions). As $n\rightarrow\infty$
\begin{align*} n\{\widehat{V}(\widehat{\beta}) - \text{Var}(\widehat{\beta})\} & \overset{p}{\rightarrow} C \neq 0 \end{align*}The notation $\overset{p}{\rightarrow}$ means "convergence in probability." It's a notion of convergence used for random variables.
For a "good" statistical estimator, you want the difference between the estimate and the truth to tend to zero. An estimator with this property is called consistent.
That the difference is not equal to zero above is troublesome.
Consider the heteroskedastic generative model:
\begin{align*} y = X\beta + \varepsilon;\;\; \text{E}(\varepsilon) = 0;\;\;\text{Var}(\varepsilon) = \sigma^2_\varepsilon W^{-1}. \end{align*}where $\varepsilon_1,...,\varepsilon_n$ are independent, and $\text{Var}(\varepsilon_i) = \sigma^2_\varepsilon/w_{ii}$
Suppose (unrealistically) that I knew the value of $\varepsilon_i$, and consider the random variable $\varepsilon_i^2$
\begin{align*} \text{E}(\varepsilon_i^2) &= \text{Var}(\varepsilon_i) + \{\text{E}(\varepsilon_i)\}^2\\ &= \text{Var}(\varepsilon_i) = \sigma^2_\varepsilon/w_{ii} \end{align*}Let $\text{diag}[\varepsilon_i^2]$ be an $n\times n$ diagonal matrix with $\varepsilon_i^2$ on the $i$th diagonal, and consider the variance estimator $(X^\top X)^{-1}X^\top\text{diag}[\varepsilon_i^2]X(X^\top X)^{-1}$
The proposed variance estimator is an unbiased estimator for $\text{Var}(\widehat{\beta})$, even under heteroskedasticity!
\begin{align*} &\text{E}((X^\top X)^{-1}X^\top\text{diag}[\varepsilon_i^2]X(X^\top X)^{-1})\\ &= (X^\top X)^{-1}X^\top(\sigma^2_\varepsilon W^{-1})X(X^\top X)^{-1}\\ &= \text{Var}(\widehat{\beta}) \end{align*}Regrettably we can't use this estimator because it depends on $\varepsilon$!
Instead, we'll use a similar estimator which replaces $\varepsilon$ with $e = (I-H)y$, the ordinary least squares residuals:
This is called a heteroskedasticity-consistent (HC) variance estimator.
Motivating the form:
$h_{ii}$ is the $ii$ diagonal of $H$ (the leverage)
Under homoskedasticity, $\text{E}(e_i^2/(1-h_{ii})) = \text{Var}(e_i/\sqrt{1-h_{ii}}) = \sigma^2_\varepsilon.$
Important robustness property: under suitable conditions, this variance estimator is consistent even if the truth is heteroskedastic:
\begin{align*} n\{\widehat{V}_{HC2}(\widehat{\beta}) - \text{Var}(\widehat{\beta})\} & \overset{p}{\rightarrow} 0 \end{align*}You may also see this estimator referred to as a sandwich estimator
$(X^\top X)^{-1}X^\top$ is the "bread"
$\text{diag}[e_i^2/(1-h_{ii})]$ is the "meat"
The square roots of the diagonal elements of $\widehat{V}_{HC2}(\widehat{\beta})$ provide alternative standard errors for $\widehat{\beta}$ that are consistent under heteroskedasticity:
Heteroskedasticity-Consistent Standard Errors
\begin{align*} \text{se}_{HC2}(\widehat{\beta}_j) = \sqrt{\widehat{V}_{HC2}(\widehat{\beta})_{(j+1), (j+1)}} \end{align*}Heteroskedasticity-Consistent Standard Errors
\begin{align*} \text{se}_{HC2}(\widehat{\beta}_j) = \sqrt{\widehat{V}_{HC2}(\widehat{\beta})_{(j+1), (j+1)}} \end{align*}These standard errors are valid regardless of whether or not the truth is homoskedastic.
If the truth actually is homoskedastic, the conventional standard errors perform better (more stable).
But rarely is one certain that the truth is homoskedastic...
(Why "HC2?" There are other standard errors that are consistent under heteroskedasticity. A famous paper comparing them labeled this one number "2.")
library(sandwich)
ols.supervisors = lm(supervisors~workers)
Vhat.HC2 = vcovHC(ols.supervisors, type = "HC2")
se.HC2 = sqrt(diag(Vhat.HC2))
se.HC2
> se.HC2
(Intercept) workers
11.48335462 0.01906731
Compare these to the standard errors derived under the assumption of homoskedasticity (highly questionable for this data).
summary(ols.supervisors)$coef[,2]
(Intercept) workers
9.56201165 0.01132565
Note that HC standard errors are much larger!
Let's compare 95% confidence intervals for $\widehat{\beta}_{workers}$ constructed using the $t$ distribution
Assuming Homoskedasticity (Seems Dubious)
\begin{align*} CI_{95\%, \text{homoskedastic}} &= [0.082, 0.129] \end{align*}Heteroskedasticity-Consistent
\begin{align*} CI_{95\%, \text{het. consistent}} &= [0.066, 0.145] \end{align*}Given clear indication of heteroskedasticity, we should only trust the confidence intervals using HC standard errors.
The usual standard errors assuming homoskedasticity would likely provide confidence intervals with coverage much lower than 95%
Let $\tilde{\beta} = \tilde{D}y$ (it's a linear estimator), and let $D = \tilde{D}-(X^\top WX)^{-1}X^\top W$, such that $\tilde{\beta}= \{(X^\top WX)^{-1}X^\top W + D\}y = \widehat{\beta}_{WLS} + Dy$.
If $\tilde{\beta}$ is unbiased, then $\text{E}(\tilde{\beta}) = \beta$. Note that
\begin{align*} \text{E}(\tilde{\beta}) &= \text{E}(\{(X^\top WX)^{-1}X^\top W + D\}y)\\ &= \text{E}((X^\top WX)^{-1}X^\top Wy) + \text{E}(Dy)\\ &= \text{E}(\widehat{\beta}_{WLS}) + \text{E}(Dy)\\ &= \beta + DX\beta, \end{align*}so that $DX$ must equal $0$ (otherwise, $\tilde{\beta}$ wouldn't generally be unbiased).
Now, compute $\text{Var}(\tilde{\beta})$ (recall $DX = 0$)
\begin{align*} \text{Var}(\tilde{\beta}) &= \sigma^2_\varepsilon \{(X^\top WX)^{-1}X^\top W + D\}W^{-1} \{WX(X^\top WX)^{-1}+ D^\top\}\\ &= \sigma^2_\varepsilon\left\{(X^\top WX)^{-1} + DX(X^\top WX)^{-1} + (X^\top WX)^{-1}X^\top D^\top\right.\\ &\left.+ DW^{-1}D^\top\right\}\\ &= \sigma^2_\varepsilon (X^\top WX)^{-1} + \sigma^2_\varepsilon DW^{-1}D^\top\\ &= \text{Var}(\widehat{\beta}_{WLS}) + \sigma^2_\varepsilon DW^{-1}D^\top \end{align*}Finally, note that for any vector $a$, $a^\top DW^{-1}D^\top a = (W^{-1/2}D^\top a)^\top W^{-1/2}D^\top a \geq 0$. Hence,
\begin{align*} \text{Var}(a^\top\tilde{\beta}) &= \sigma^2_\varepsilon a^\top(X^\top WX)^{-1}a + \sigma^2_\varepsilon a^\top DW^{-1}D^\top a\\ &= \text{Var}(a^\top\widehat{\beta}_{WLS}) + \sigma^2_\varepsilon a^\top DW^{-1}D^\top a\\ &\geq \text{Var}(a^\top\widehat{\beta}_{WLS}) \end{align*}