STATS 413: Lecture 15

  1. Heteroskedasticity, estimation, and inference

  2. Weighted least squares

  3. Heteroskedasticity-consistent standard errors

Addressing Departures from Assumptions

In the first part of this course, we

  • Derived certain consequences of the ordinary least squares slope coefficients under the assumptions of the weaker and stronger linear models

  • Discussed diagnostics to assess whether these assumptions seem reasonable

Now: What happens if certain assumptions seem unreasonable? Can we proceed? If so, how?

  • What if the true trend doesn't seem linear?

  • What happens if the data is actually heteroskedastic?

  • What if normality seems violated?

Heteroskedasticity, Estimation, and Inference

We'll begin by assessing the implications of heteroskedasticity for estimation and inference (while deferring discussion of prediction intervals):

  • If the data are heteroskedastic, what happens to the expectation and variance of the ordinary least squares solution $\widehat{\beta}$? Are the usual standard errors still valid?

  • If the data are heteroskedastic and I know $\text{Var}(\varepsilon)$, should I still use $\widehat{\beta}$ as an estimator?

  • If the data are heteroskedastic but I don't know $\text{Var}(\varepsilon)$, can I find new standard errors for $\widehat{\beta}_j$ that work under heteroskedasticity?

Properties of OLS Solution under Homoskedasticity

Suppose the weaker linear model holds:

\begin{align*} y &= X\beta + \varepsilon;\;\; \text{E}(\varepsilon) = 0;\;\; \text{Var}(\varepsilon) = \sigma^2_\varepsilon I_{n\times n} \end{align*}

Let $\widehat{\beta}$ be the ordinary least squares solution,

\begin{align*} \widehat{\beta} = \arg\min\nolimits_\beta (y-X\beta)^\top(y-X\beta) &= (X^\top X)^{-1}X^\top y \end{align*}

We have shown that under the weaker linear model,

  • $\text{E}(\widehat{\beta}) = \beta$ (unbiasedness)

  • $\text{Var}(\widehat{\beta}) = \sigma^2_\varepsilon (X^\top X)^{-1}$

An Example: Workers and Supervisors

We have data on the number of workers and the number of supervisors in 27 factories across the United States:

Heteroskedasticity

When discussing heteroskedasticity, we will consider the following generative model:

\begin{align*} y = X\beta + \varepsilon;\;\; \text{E}(\varepsilon) = 0;\;\;\text{Var}(\varepsilon) = \sigma^2_\varepsilon W^{-1}. \end{align*}

where

  • $\varepsilon_1,...,\varepsilon_n$ are independent

  • $W$ is a $n\times n$ diagonal matrix with $w_{ii} > 0$ on the diagonals (and zeroes on the off-diagonals).

  • Implies $W^{-1}$ is an $n\times n$ diagonal, with $1/w_{ii}$ on the diagonals and zeroes on the off-diagonals

Note that $W = W^{-1} = I_{n\times n}$ implies homoskedasticity.

Implications of Heteroskedasticity

Let $\widehat{\beta}$ be the ordinary least squares solution, and suppose the previous model for heteroskedasticity holds. Then:

\begin{align*} \text{E}(\widehat{\beta}) &= \beta\\ \text{Var}(\widehat{\beta}) &= \sigma^2_\varepsilon (X^\top X)^{-1}X^\top W^{-1}X(X^\top X)^{-1}\\ &\neq \sigma^2_\varepsilon (X^\top X)^{-1} \end{align*}

Implications of Heteroskedasticity

Consequences

  • $\widehat{\beta}$ is still unbiased for $\beta$ under heteroskedasticity

  • The usual standard errors reported by R for $\widehat{\beta}_j$ are invalid under heteroskedasticity. They can be too large or too small.

    • These standard errors are based upon $\widehat{V}(\widehat{\beta}) = \widehat{\sigma}^2_\varepsilon (X^\top X)^{-1}$, which was justified under homoskedasticity
  • Even if $\varepsilon$ is multivariate normal, the usual hypothesis tests and confidence intervals will also be invalid!

Summary Table in Our Example

lm(supervisors~workers)
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.44806    9.56201   1.511    0.143
workers      0.10536    0.01133   9.303 1.35e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 21.73 on 25 degrees of freedom
Multiple R-squared:  0.7759,	Adjusted R-squared:  0.7669
F-statistic: 86.54 on 1 and 25 DF,  p-value: 1.35e-09

Can trust the slope coefficient estimates themselves under heteroskedasticity, but not the standard errors, and not any of the $p$-values shown here!

Accounting for Heteroskedasticity

\begin{align*} y = X\beta + \varepsilon;\;\; \text{E}(\varepsilon) = 0;\;\;\text{Var}(\varepsilon) = \sigma^2_\varepsilon W^{-1}. \end{align*}

We'll now consider two approaches to dealing with heteroskedasticity, depending upon whether we know the value of $W$.

  • If $W$ is known, we can construct an estimator that actually improves upon the ordinary least squares solution, through a technique called weighted least squares

  • If $W$ is unknown, we can construct standard errors for the OLS solution $\widehat{\beta}$ that are valid even under heteroskedasticity. These are called heteroskedasticity-consistent (HC) standard errors.

Weighting the Responses and Predictors

Let's assume that $W$ is known. Recall that $W^{-1}$ is a diagonal matrix with $w_{ii}^{-1}$ on its diagonals:

\begin{align*} y = X\beta + \varepsilon;\;\; \text{E}(\varepsilon) = 0;\;\;\text{Var}(\varepsilon) = \sigma^2_\varepsilon W^{-1}, \end{align*}

Let $W^{1/2}$ be a diagonal matrix with $w_{ii}^{1/2} = \sqrt{w_{ii}}$ on the diagonals. Consider the transformation:

\begin{align*} W^{1/2}y = W^{1/2}\left(X\beta +\varepsilon\right). \end{align*}

Weighting the Responses and Predictors

  • $(W^{1/2}y)_i = \sqrt{w_{ii}}\times y_i$

  • $(W^{1/2}X)_{ij} = \sqrt{w_{ii}}\times x_{ij}$

Homoskedastic After Transformation

\begin{align*} \text{Var}(W^{1/2}y) = \text{Var}(W^{1/2}\varepsilon) = \sigma^2_\varepsilon I_{n\times n} \end{align*}

Weighted Least Squares (WLS)

Consider running an ordinary least squares regression of the transformed responses, $W^{1/2}y$, on the transformed predictors, $W^{1/2}X$.

\begin{align*} \widehat{\beta}_{WLS} &= \arg\min_\beta (W^{1/2}y-W^{1/2}X\beta)^\top(W^{1/2}y-W^{1/2}X\beta)\\ &= \arg\min_\beta (y-X\beta)^\top W(y-X\beta)\\ &=\textstyle \arg\min_\beta \sum_{i=1}^nw_{ii}(y_i-x_i^\top\beta)^2\\ &= (X^\top WX)^{-1}X^\top Wy \end{align*}

This is called the weighted least squares solution. The terms $w_{ii}$ weight the individual observations

  • Larger $w_{ii}$: bigger penalty on $(y_i-x_i^\top\beta)^2$. Stronger emphasis on fit for individual $i$.

Properties of WLS Coefficients

The weighted least square solution is $\widehat{\beta}_{WLS} = (X^\top WX)^{-1}X^\top Wy$

Expectation and Variance of $\widehat{\beta}_{WLS}$

\begin{align*} \text{E}(\widehat{\beta}_{WLS}) &= \beta\\ \text{Var}(\widehat{\beta}_{WLS}) &= \sigma^2_\varepsilon (X^\top WX)^{-1} \end{align*}

Properties of WLS Coefficients

Comparing these results with what we obtained for the ordinary least squares solution, $\widehat{\beta}_{OLS} = (X^\top X)^{-1}X^\top y$, under heteroskedasticity:

  • Same expectation (both are unbiased for $\beta$)

  • Different variance!

    \begin{align*} \text{Var}(\widehat{\beta}_{OLS}) &= \sigma^2_\varepsilon (X^\top X)^{-1}X^\top W^{-1}X(X^\top X)^{-1}\\ &\neq \sigma^2_\varepsilon (X^\top WX)^{-1} \end{align*}

$\widehat{\beta}_{OLS}$ and $\widehat{\beta}_{WLS}$ are both unbiased for $\beta$, but have different variances. Which should we prefer?

Gauss-Markov Theorem

\begin{align*} y = X\beta + \varepsilon;\;\; \text{E}(\varepsilon) = 0;\;\;\text{Var}(\varepsilon) = \sigma^2_\varepsilon W^{-1}. \end{align*}

Under this generative model for $y$, it turns out that $\widehat{\beta}_{WLS}$ is the best linear unbiased estimator (BLUE) of $\beta$.

  • Linear: estimators $\tilde{\beta}$ that can be written in the form $\tilde{\beta} = C y$ for some matrix $C$

  • Unbiased: estimators $\tilde{\beta}$ such that $\text{E}(\tilde{\beta}) = \beta$

  • Best: For any vector of constants $a$, $\text{Var}(a^\top\tilde{\beta})$ has the smallest possible variance out of all unbiased linear estimators.

Gauss-Markov Theorem

Gauss-Markov Theorem

Let $\widehat{\beta}_{WLS}$ be the weighted least squares solution. Suppose the above generative model and that $X$ is full rank. Let $\tilde{\beta}$ be another linear unbiased estimator of $\beta$. Then for any constant vector $a \in \mathbb{R}^{p+1}$

\begin{align*} \text{Var}(a^\top\widehat{\beta}_{WLS}) \leq \text{Var}(a^\top\tilde{\beta}) \end{align*}

Corollaries of Gauss-Markov

Note that if $W = W^{-1} = I$, $\widehat{\beta}_{WLS} = \widehat{\beta}_{OLS}$

  • Implication: Under homoskedasticity, $\widehat{\beta}_{OLS}$ is the best linear unbiased estimator for $\beta$

  • Provides additional justification for using OLS to estimate $\beta$ under the weaker and stronger linear model

  • Any other linear, unbiased estimator would be worse than $\widehat{\beta}_{OLS}$

Corollaries of Gauss-Markov

In the general case where $W\neq I$, we see that $\widehat{\beta}_{WLS}$ is preferred to $\widehat{\beta}_{OLS}$

  • Take $\tilde{D} = (X^\top X)^{-1}X^\top$ in the proof, and you see that $\widehat{\beta}_{OLS}$ has larger variance than $\widehat{\beta}_{WLS}$.

  • Let $a = (a_0,...,a_p)^\top$ have a one in the $j$th entry, zero otherwise, you see that $\text{Var}(\widehat{\beta}_{OLS,j}) \geq \text{Var}(\widehat{\beta}_{WLS, j})$

See the end of the lecture deck for a proof of Gauss-Markov

Intuition for the Weights

Recall that our model for heteroskedasticity states

\begin{align*} \text{Var}(\varepsilon) &= \sigma^2_\varepsilon W^{-1}, \end{align*}

so that the variances of $\varepsilon_i$ are proportional to the inverse of the weights.

  • Small $\text{var}(\varepsilon_i)$: larger weight for that observation when determining $\widehat{\beta}_{WLS}$.

    • Believe that $y_i$ is on average closer to $\text{E}(y_i) = x_i^\top\beta$, so prioritize the prediction equation fitting that point better

Intuition for the Weights

Recall that our model for heteroskedasticity states

\begin{align*} \text{Var}(\varepsilon) &= \sigma^2_\varepsilon W^{-1}, \end{align*}

so that the variances of $\varepsilon_i$ are proportional to the inverse of the weights.

  • Large $\text{var}(\varepsilon_i)$: smaller weight for that observation when determining $\widehat{\beta}_{WLS}$.

    • $y_i$ could be quite far from $\text{E}(y_i) = x_i^\top\beta$ due to variability, so should not prioritize the prediction equation fitting that particular point well.

Standard Errors for Weighted Least Squares

We have so far assumed that $W$ is known, but have made no assumption that $\sigma^2_\varepsilon$ is known

\begin{align*} \text{Var}(\varepsilon) &= \sigma^2_\varepsilon W^{-1}, \end{align*}

If it is unknown, we can estimate it as follows

\begin{align*} \widehat{\sigma}^2_{\varepsilon, WLS} &= \frac{(y-X\widehat{\beta}_{WLS})^\top W(y-X\widehat{\beta}_{WLS})}{n-p-1} \end{align*}

We can then estimate $\text{Var}(\widehat{\beta}_{WLS})$ by

\begin{align*} \widehat{V}(\widehat{\beta}_{WLS}) &= \widehat{\sigma}^2_{\varepsilon, WLS}(X^\top WX)^{-1} \end{align*}

From here, we can derive standard errors for $\widehat{\beta}_{j, WLS}$ as the square roots of the diagonals of $\widehat{V}(\widehat{\beta}_{WLS})$.

When Would We Know $W$?

In order to run weighted least squares, we need to know the value for $W$, as $\widehat{\beta}_{WLS} = (X^\top WX)^{-1}X^\top Wy$. Is this reasonable? Sometimes...

  • If each observation $i$ represents the average response among $n_i$ equally weighted responses in a given group, then $w_{ii} = n_i$

    • Can occur when each observation $i$ represents the number of respondents/participants at a given location (in a city, state, country,...)
    • Can occur if running experiments of different sizes at each $x_i$.

When Would We Know $W$?

In order to run weighted least squares, we need to know the value for $W$, as $\widehat{\beta}_{WLS} = (X^\top WX)^{-1}X^\top Wy$. Is this reasonable? Sometimes...

  • If it's known that the standard deviation is proportional to a predictor variable $x_i$, then $w_{ii} = 1/x_i^2$

    • In many systems, larger values for the predictors introduce more variability in the responses.
    • For instance: more variation in net worth as age increases.
    • Would require subject matter knowledge.

Weighted Least Squares and Varying Priorities

If we simply view $W$ as a weight matrix, without necessarily believing $\text{Var}(\varepsilon) = \sigma^2_\varepsilon W^{-1}$, weighted least squares can account for different priorities on performance of predictions at different covariates $x_i$.

\begin{align*}\textstyle \widehat{\beta}_{WLS} = \arg\min_\beta \sum_{i=1}^nw_{ii}(y_i-x_i^\top\beta)^2 \end{align*}
  • If I care more about accuracy of predictions at certain values of $x_i$, then I can place a larger weight $w_{ii}$ on the residuals for those observations

  • There could be predictor variables that encode sensitive attributes, or reflect observations of high value to stakeholders

Under this use case for weighted least squares, the Gauss-Markov justification for WLS goes out the window. You are instead arguing that the loss function underpinning WLS better aligns with your priorities.

Weighted Least Squares in Our Example

In our example regressing supervisors on workers, it appears that the error standard deviation is proportional to the number of workers. Remembering that $\text{Var}(\varepsilon_i) = \sigma^2_\varepsilon/w_{ii}$

\begin{align*} w_{ii} &= 1/(\text{workers}_i)^2 \end{align*}
lm(supervisors~workers, weights = 1/workers^2)

Summary Table for Weighted Least Squares

lm(formula = supervisors ~ workers, weights = 1/workers^2)
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.803296   4.569745   0.832    0.413
workers     0.120990   0.008999  13.445 6.04e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.02266 on 25 degrees of freedom
Multiple R-squared:  0.8785,	Adjusted R-squared:  0.8737
F-statistic: 180.8 on 1 and 25 DF,  p-value: 6.044e-13

If in reality $\sigma^2_i \propto \text{workers}_i^2$, then we can trust these standard errors and $p$-values.

What if $\text{Var}(\varepsilon)$ is Unknown?

If we don't know $W$ (and hence, we don't know $\text{Var} (\varepsilon)$), there are two potential paths to explore

  1. Simply use $\widehat{\beta}_{OLS}$, which doesn't require a weight matrix, and come up with standard error estimates that are robust to heteroskedasticity

    • While coefficient estimates will no longer be optimal in Gauss-Markov sense, this will allow us to nonetheless conduct inference
    • We will develop this next.

What if $\text{Var}(\varepsilon)$ is Unknown?

If we don't know $W$ (and hence, we don't know $\text{Var}(\varepsilon)$), there are two potential paths to explore

  1. Try to estimate $\text{Var}(\varepsilon)$ using the residuals from ordinary least squares. Then, use the inverse of the estimated variances as weights

    • Called feasible weighted least squares (FWLS).
    • We won't consider this approach in this course (not particularly common in practice).

Shortcoming of Existing Standard Errors

Under homoskedasticity, we based standard errors off of

\begin{align*} \widehat{V}(\widehat{\beta}) &= \widehat{\sigma}^2_\varepsilon(X^\top X)^{-1}, \end{align*}

which was based upon the fact that under homoskedasticity, $\text{Var}(\widehat{\beta}) = \sigma^2_\varepsilon(X^\top X)^{-1}$.

Shortcoming of Existing Standard Errors

Under heteroskedasticity, this variance estimate is generally inconsistent (even under favorable conditions). As $n\rightarrow\infty$

\begin{align*} n\{\widehat{V}(\widehat{\beta}) - \text{Var}(\widehat{\beta})\} & \overset{p}{\rightarrow} C \neq 0 \end{align*}
  • The notation $\overset{p}{\rightarrow}$ means "convergence in probability." It's a notion of convergence used for random variables.

  • For a "good" statistical estimator, you want the difference between the estimate and the truth to tend to zero. An estimator with this property is called consistent.

  • That the difference is not equal to zero above is troublesome.

Towards Standard Errors under Heteroskedasticity

Consider the heteroskedastic generative model:

\begin{align*} y = X\beta + \varepsilon;\;\; \text{E}(\varepsilon) = 0;\;\;\text{Var}(\varepsilon) = \sigma^2_\varepsilon W^{-1}. \end{align*}

where $\varepsilon_1,...,\varepsilon_n$ are independent, and $\text{Var}(\varepsilon_i) = \sigma^2_\varepsilon/w_{ii}$

Suppose (unrealistically) that I knew the value of $\varepsilon_i$, and consider the random variable $\varepsilon_i^2$

\begin{align*} \text{E}(\varepsilon_i^2) &= \text{Var}(\varepsilon_i) + \{\text{E}(\varepsilon_i)\}^2\\ &= \text{Var}(\varepsilon_i) = \sigma^2_\varepsilon/w_{ii} \end{align*}

Let $\text{diag}[\varepsilon_i^2]$ be an $n\times n$ diagonal matrix with $\varepsilon_i^2$ on the $i$th diagonal, and consider the variance estimator $(X^\top X)^{-1}X^\top\text{diag}[\varepsilon_i^2]X(X^\top X)^{-1}$

Towards Standard Errors under Heteroskedasticity

The proposed variance estimator is an unbiased estimator for $\text{Var}(\widehat{\beta})$, even under heteroskedasticity!

\begin{align*} &\text{E}((X^\top X)^{-1}X^\top\text{diag}[\varepsilon_i^2]X(X^\top X)^{-1})\\ &= (X^\top X)^{-1}X^\top(\sigma^2_\varepsilon W^{-1})X(X^\top X)^{-1}\\ &= \text{Var}(\widehat{\beta}) \end{align*}
  • Regrettably we can't use this estimator because it depends on $\varepsilon$!

  • Instead, we'll use a similar estimator which replaces $\varepsilon$ with $e = (I-H)y$, the ordinary least squares residuals:

\begin{align*} \widehat{V}_{HC2}(\widehat{\beta}) &= (X^\top X)^{-1}X^\top\text{diag}[e_i^2/(1-h_{ii})]X(X^\top X)^{-1} \end{align*}

This is called a heteroskedasticity-consistent (HC) variance estimator.

Heteroskedasticity-Consistent Variance Estimator

\begin{align*} \widehat{V}_{HC2}(\widehat{\beta}) &= (X^\top X)^{-1}X^\top\text{diag}[e_i^2/(1-h_{ii})]X(X^\top X)^{-1} \end{align*}

Motivating the form:

  • $h_{ii}$ is the $ii$ diagonal of $H$ (the leverage)

  • Under homoskedasticity, $\text{E}(e_i^2/(1-h_{ii})) = \text{Var}(e_i/\sqrt{1-h_{ii}}) = \sigma^2_\varepsilon.$

Heteroskedasticity-Consistent Variance Estimator

\begin{align*} \widehat{V}_{HC2}(\widehat{\beta}) &= (X^\top X)^{-1}X^\top\text{diag}[e_i^2/(1-h_{ii})]X(X^\top X)^{-1} \end{align*}

Important robustness property: under suitable conditions, this variance estimator is consistent even if the truth is heteroskedastic:

\begin{align*} n\{\widehat{V}_{HC2}(\widehat{\beta}) - \text{Var}(\widehat{\beta})\} & \overset{p}{\rightarrow} 0 \end{align*}

You may also see this estimator referred to as a sandwich estimator

  • $(X^\top X)^{-1}X^\top$ is the "bread"

  • $\text{diag}[e_i^2/(1-h_{ii})]$ is the "meat"

Heteroskedasticity-Consistent Standard Errors

The square roots of the diagonal elements of $\widehat{V}_{HC2}(\widehat{\beta})$ provide alternative standard errors for $\widehat{\beta}$ that are consistent under heteroskedasticity:

Heteroskedasticity-Consistent Standard Errors

\begin{align*} \text{se}_{HC2}(\widehat{\beta}_j) = \sqrt{\widehat{V}_{HC2}(\widehat{\beta})_{(j+1), (j+1)}} \end{align*}

Heteroskedasticity-Consistent Standard Errors

Heteroskedasticity-Consistent Standard Errors

\begin{align*} \text{se}_{HC2}(\widehat{\beta}_j) = \sqrt{\widehat{V}_{HC2}(\widehat{\beta})_{(j+1), (j+1)}} \end{align*}
  • These standard errors are valid regardless of whether or not the truth is homoskedastic.

  • If the truth actually is homoskedastic, the conventional standard errors perform better (more stable).

  • But rarely is one certain that the truth is homoskedastic...

(Why "HC2?" There are other standard errors that are consistent under heteroskedasticity. A famous paper comparing them labeled this one number "2.")

HC2 Standard Errors in Our Example

library(sandwich)
ols.supervisors = lm(supervisors~workers)
Vhat.HC2 = vcovHC(ols.supervisors, type = "HC2")
se.HC2 = sqrt(diag(Vhat.HC2))
se.HC2
> se.HC2
(Intercept)     workers
11.48335462  0.01906731

Compare these to the standard errors derived under the assumption of homoskedasticity (highly questionable for this data).

summary(ols.supervisors)$coef[,2]
(Intercept)     workers
 9.56201165  0.01132565

Note that HC standard errors are much larger!

Comparing Confidence Intervals in Our Example

Let's compare 95% confidence intervals for $\widehat{\beta}_{workers}$ constructed using the $t$ distribution

Assuming Homoskedasticity (Seems Dubious)

\begin{align*} CI_{95\%, \text{homoskedastic}} &= [0.082, 0.129] \end{align*}

Heteroskedasticity-Consistent

\begin{align*} CI_{95\%, \text{het. consistent}} &= [0.066, 0.145] \end{align*}
  • Given clear indication of heteroskedasticity, we should only trust the confidence intervals using HC standard errors.

  • The usual standard errors assuming homoskedasticity would likely provide confidence intervals with coverage much lower than 95%

(Optional) Proof of Gauss-Markov

Let $\tilde{\beta} = \tilde{D}y$ (it's a linear estimator), and let $D = \tilde{D}-(X^\top WX)^{-1}X^\top W$, such that $\tilde{\beta}= \{(X^\top WX)^{-1}X^\top W + D\}y = \widehat{\beta}_{WLS} + Dy$.

If $\tilde{\beta}$ is unbiased, then $\text{E}(\tilde{\beta}) = \beta$. Note that

\begin{align*} \text{E}(\tilde{\beta}) &= \text{E}(\{(X^\top WX)^{-1}X^\top W + D\}y)\\ &= \text{E}((X^\top WX)^{-1}X^\top Wy) + \text{E}(Dy)\\ &= \text{E}(\widehat{\beta}_{WLS}) + \text{E}(Dy)\\ &= \beta + DX\beta, \end{align*}

so that $DX$ must equal $0$ (otherwise, $\tilde{\beta}$ wouldn't generally be unbiased).

(Optional) Proof of Gauss-Markov

Now, compute $\text{Var}(\tilde{\beta})$ (recall $DX = 0$)

\begin{align*} \text{Var}(\tilde{\beta}) &= \sigma^2_\varepsilon \{(X^\top WX)^{-1}X^\top W + D\}W^{-1} \{WX(X^\top WX)^{-1}+ D^\top\}\\ &= \sigma^2_\varepsilon\left\{(X^\top WX)^{-1} + DX(X^\top WX)^{-1} + (X^\top WX)^{-1}X^\top D^\top\right.\\ &\left.+ DW^{-1}D^\top\right\}\\ &= \sigma^2_\varepsilon (X^\top WX)^{-1} + \sigma^2_\varepsilon DW^{-1}D^\top\\ &= \text{Var}(\widehat{\beta}_{WLS}) + \sigma^2_\varepsilon DW^{-1}D^\top \end{align*}

Finally, note that for any vector $a$, $a^\top DW^{-1}D^\top a = (W^{-1/2}D^\top a)^\top W^{-1/2}D^\top a \geq 0$. Hence,

\begin{align*} \text{Var}(a^\top\tilde{\beta}) &= \sigma^2_\varepsilon a^\top(X^\top WX)^{-1}a + \sigma^2_\varepsilon a^\top DW^{-1}D^\top a\\ &= \text{Var}(a^\top\widehat{\beta}_{WLS}) + \sigma^2_\varepsilon a^\top DW^{-1}D^\top a\\ &\geq \text{Var}(a^\top\widehat{\beta}_{WLS}) \end{align*}