STATS 413: Lecture 9

Inference for linear combinations of slope coefficients
Inference for conditional expectations

Recap: Interactions

Recall: Region takes on values SOUTH, WEST, MIDWEST.

                    Estimate Std. Error t value Pr(>|t|)
(Intercept)        -555.5274   212.5735  -2.613  0.01817 *
RegionSOUTH         837.0791   292.4465   2.862  0.01079 *
RegionWEST          527.3892   393.8955   1.339  0.19823
Bonus                 1.8994     0.6424   2.957  0.00883 **
Advert                2.6585     0.2132  12.469 5.59e-10 ***
RegionSOUTH:Bonus    -0.8742     0.9571  -0.913  0.37382
RegionWEST:Bonus      0.7477     1.0145   0.737  0.47117
RegionSOUTH:Advert   -1.5546     0.4800  -3.239  0.00483 **
RegionWEST:Advert    -1.6964     0.5644  -3.006  0.00796 **

Recap: Interactions

Which region's slope on Bonus is estimated to be 1.8994?
What does the coefficient on RegionWEST:Advert represent?
How would I assess whether I need to include the interaction between Advert and Region in the above regression?

Linear Combinations of Slope Coefficients

Today we'll discuss the distribution of linear combinations of estimated slope coefficients:

\begin{align*} a^\top\hat{\beta} = \sum_{j=0}^p a_{j+1}\hat{\beta}_j \end{align*}

We've been studying a special case to date:

Setting $a_{j+1} = 1$ and the remaining entries zero returns the slope coefficient $\hat{\beta}_j$.

Linear Combinations of Slope Coefficients

Today we'll discuss the distribution of linear combinations of estimated slope coefficients:

\begin{align*} a^\top\hat{\beta} = \sum_{j=0}^p a_{j+1}\hat{\beta}_j \end{align*}

The more general form will be useful for, among other things:

Inference for slope coefficients for the non-reference category when doing regression with interactions
Inference for $E(Y\mid X = \tilde{x})$.

Reminder: Expectations of Random Vectors

We'll now explore the distribution of certain linear combinations of slopes. Let's first recall some basic properties about random vectors:

Let $Z\in \mathbb{R}^m$ be a random vector.

\begin{align*} Z &= (Z_1, \ldots, Z_m)^\top \\ E(Z) &= (E(Z_1), \ldots, E(Z_m))^\top \end{align*}

Properties of Expectation

For $A\in \mathbb{R}^{n\times m}$, $c\in \mathbb{R}^n$, $A,c$ constant:

\begin{align*} E(AZ + c) &= A\;E(Z) + c \end{align*}

Reminder: Variance/Covariance

For $Z\in \mathbb{R}^m$ a random vector:

\begin{align*} \text{Var}(Z) &= E[(Z-E(Z))(Z-E(Z))^\top]\\ &= E(ZZ^\top) - E(Z)E(Z)^\top\\ &=\left( \begin{array}{cccc} \text{var}(Z_1) & \cdots & \cdots & \text{cov}(Z_1, Z_m) \\ \text{cov}(Z_2, Z_1) & \text{cov}(Z_2,Z_2) & \cdots & \text{cov}(Z_2, Z_m) \\ \vdots & \ddots & \ddots & \vdots \\ \text{cov}(Z_m, Z_1) & \cdots & \cdots & \text{var}(Z_m) \end{array} \right) \end{align*}

Reminder: Variance/Covariance

Properties of Variance

For $A\in \mathbb{R}^{n\times m}$, $c \in \mathbb{R}^n$, $A,c$ constant:

\begin{align*} \text{Var}(AZ + c) &= A\;\text{Var}(Z)A^\top \end{align*}

Reminder: $\hat{\beta}$

For $\hat{\beta}\in \mathbb{R}^{p+1}$ our vector of estimated coefficients:

\begin{align*} E(\hat{\beta}) &= (E(\hat{\beta}_0), \ldots,E(\hat{\beta}_p))^\top\\ \text{Var}(\hat{\beta}) &= E[(\hat{\beta}-E(\hat{\beta}))(\hat{\beta}-E(\hat{\beta}))^\top]\\ &=\left( \begin{array}{cccc} \text{var}(\hat{\beta}_0) & \cdots & \cdots & \text{cov}(\hat{\beta}_0, \hat{\beta}_p) \\ \text{cov}(\hat{\beta}_1, \hat{\beta}_0) & \text{var}(\hat{\beta}_1) & \cdots & \text{cov}(\hat{\beta}_1, \hat{\beta}_p) \\ \vdots & \ddots & \ddots & \vdots \\ \text{cov}(\hat{\beta}_p, \hat{\beta}_0) & \cdots & \cdots & \text{var}(\hat{\beta}_p) \end{array} \right) \end{align*}

Reminder: $\hat{\beta}$

Properties

For $A\in \mathbb{R}^{d\times (p+ 1)}$, $c \in \mathbb{R}^d$, $A,c$ constant:

\begin{align*} E(A\hat{\beta} + c) &= AE(\hat{\beta}) + c\\ \text{Var}(A\hat{\beta} + c) &= A\;\text{Var}(\hat{\beta})A^\top \end{align*}

Reminder: $\hat{\beta}$

Consider the stronger linear model

\begin{align*} y = X\beta + \varepsilon, \quad \varepsilon &\sim \text{MVN}(0, \sigma^2_\varepsilon I) \end{align*}

with $E(\varepsilon) = 0$, $\text{Var}(\varepsilon) = \sigma^2_\varepsilon I$.

Reminder: $\hat{\beta}$

Consequences

Suppose the stronger linear model holds. For $A\in \mathbb{R}^{d\times (p+1)}$, $c \in \mathbb{R}^d$, $A,c$ constant:

\begin{align*} E(A\hat{\beta} + c) &= A\beta + c\\ \text{Var}(A\hat{\beta} + c) &= \sigma^2_\varepsilon A(X^\top X)^{-1}A^\top\\ A\hat{\beta} + c &\sim \text{MVN}(A\beta + c, \sigma^2_\varepsilon A(X^\top X)^{-1}A^\top) \end{align*}

Multivariate normality is a consequence of following: if $Z$ follows a multivariate normal distribution, so too does $AZ + c$.

Estimating $\text{Var}(\hat{\beta})$

We know in practice we don't have access to $\sigma^2_\varepsilon$.

Estimated it by $\hat{\sigma}^2_\varepsilon$, the mean squared error.

We can estimate $\text{Var}(\hat{\beta})$ by $\hat{V}(\hat{\beta})$, which simply replaces $\sigma^2_\varepsilon$ with its estimate:

Estimating $\text{Var}(\hat{\beta})$

Estimated Variance Matrices

Suppose the weaker linear model holds. For $A\in \mathbb{R}^{d\times (p+1)}$, $c \in \mathbb{R}^d$, $A,c$ constant:

\begin{align*} \hat{V}(\hat{\beta}) &= \hat{\sigma}^2_\varepsilon(X^\top X)^{-1}\\ \hat{V}(\hat{\beta}_j) &= \hat{\sigma}^2_\varepsilon(X^\top X)^{-1}_{(j+1), (j+1)}\\ \hat{V}(A\hat{\beta}+c)&= \hat{\sigma}^2_\varepsilon A(X^\top X)^{-1}A^\top \end{align*}

Note that $\text{se}(\hat{\beta}_j) = \sqrt{\hat{V}(\hat{\beta}_j)}$.

Application: Slopes with Interactions

                    Estimate Std. Error t value Pr(>|t|)
(Intercept)        -555.5274   212.5735  -2.613  0.01817 *
RegionSOUTH         837.0791   292.4465   2.862  0.01079 *
RegionWEST          527.3892   393.8955   1.339  0.19823
Bonus                 1.8994     0.6424   2.957  0.00883 **
Advert                2.6585     0.2132  12.469 5.59e-10 ***
RegionSOUTH:Bonus    -0.8742     0.9571  -0.913  0.37382
RegionWEST:Bonus      0.7477     1.0145   0.737  0.47117
RegionSOUTH:Advert   -1.5546     0.4800  -3.239  0.00483 **
RegionWEST:Advert    -1.6964     0.5644  -3.006  0.00796 **

The output provides standard errors for the slopes on bonus and advert in the midwest (the reference category). What about the south and west?

Application: Slopes with Interactions

Let $\hat{\beta}_\text{Advert, South}$ be the slope for the southern region. Note

\begin{align*} \hat{\beta}_\text{Advert, South} &= \hat{\beta}_\text{Advert} + \hat{\beta}_\text{RegionSOUTH:Advert} \end{align*}

Now, using the general formula for the variance of the sum of two random variables:

\begin{align*} \text{var}(\hat{\beta}_\text{Advert, South}) &= \text{var}(\hat{\beta}_\text{Advert}) + \text{var}(\hat{\beta}_\text{RegionSOUTH:Advert})\\ &+ 2\text{cov}(\hat{\beta}_\text{Advert}, \hat{\beta}_\text{RegionSOUTH:Advert}) \end{align*}

The standard summary information can provide estimates of $\text{var}(\hat{\beta}_\text{Advert}) + \text{var}(\hat{\beta}_\text{RegionSOUTH:Advert})$, but provides no information about $\text{cov}(\hat{\beta}_\text{Advert}, \hat{\beta}_\text{RegionSOUTH:Advert})$

Instead, this would be an off-diagonal entry of $\text{Var}(\hat{\beta})$

Application: Slopes with Interactions

Consider the slope on Advert for the southern region

\begin{align*} \hat{\beta}_\text{Advert, South} &= \hat{\beta}_\text{Advert} + \hat{\beta}_\text{RegionSOUTH:Advert} \end{align*}

Define $a = (a_1,\ldots,a_9)^\top$ as:

\begin{align*} a_i &= \begin{cases} 1 & i=5,8\\ 0 & \text{otherwise} \end{cases} \end{align*}

Then,

\begin{align*} \hat{\beta}_\text{Advert, South} &= a^\top\hat{\beta}, \end{align*}

where $\hat{\beta}$ contains all 9 estimated coefficients.

 [1,] "(Intercept)"
 [2,] "RegionSOUTH"
 [3,] "RegionWEST"
 [4,] "Bonus"
 [5,] "Advert"
 [6,] "RegionSOUTH:Bonus"
 [7,] "RegionWEST:Bonus"
 [8,] "RegionSOUTH:Advert"
 [9,] "RegionWEST:Advert"

Application: Slopes with Interactions

Let $X \in \mathbb{R}^{n\times 9}$ be the design matrix from lm(Sales~Region*Bonus + Region*Advert)

\begin{align*} \hat{\beta}_\text{Advert, South} &= a^\top\hat{\beta}\\ \text{var}(\hat{\beta}_\text{Advert,South}) &= \sigma^2_{\varepsilon}a^\top(X^\top X)^{-1}a\\ \text{se}(\hat{\beta}_\text{Advert,South}) &= \hat{\sigma}_{\varepsilon}\sqrt{a^\top(X^\top X)^{-1}a} \end{align*}

Can't calculate this standard error using summary output alone.
Can calculate it if we have $\hat{\sigma}_{\varepsilon}$ and the design matrix.

Application: Slopes with Interactions

Note: the (5,8) entry of $\sigma^2_\varepsilon (X^\top X)^{-1}$ equals $\text{cov}(\hat{\beta}_\text{Advert}, \hat{\beta}_\text{RegionSOUTH:Advert})$

Show: our choice of $a$ returns
\[ \begin{aligned} &\text{var}(\hat{\beta}_\text{Advert}) + \text{var}(\hat{\beta}_\text{RegionSOUTH:Advert}) \\ &\quad+ 2\text{cov}(\hat{\beta}_\text{Advert}, \hat{\beta}_\text{RegionSOUTH:Advert}) \end{aligned} \]

Reminder: Inference for Expectations

Suppose $y_1,\ldots,y_n$ are $iid$ and normally distributed with $E(y_i) = \mu_y$ and $\text{var}(y_i) = \sigma^2$

Estimate $\mu_y$ by $\hat{\mu}_y = \bar{y} = \frac{1}{n}\sum_{i=1}^n y_i$
$\text{SD}(\hat{\mu}_y) = \sigma/\sqrt{n}$
$\text{SE}(\hat{\mu}_y) = \hat{\sigma}/\sqrt{n}$, where $\hat{\sigma} = \sqrt{\frac{1}{n-1}\sum_{i=1}^n(y_i-\bar{y})^2}$
Form confidence intervals / perform hypothesis tests for $\mu_y$ using $\hat{\mu}_y$, $\text{SE}(\hat{\mu}_y)$, and the $t_{n-1}$ distribution.

For instance, a $100(1-\alpha)$% Confidence Interval for $\mu_y$ takes the form:

\begin{align*} \hat{\mu}_y \pm t_{1-\alpha/2, n-1}se(\hat{\mu}_y) \end{align*}

New: Inference for Conditional Expectations

Now, suppose $y_1,\ldots,y_n$ are generated from the (stronger) linear model:

\begin{align*} y_i &= \beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip} + \varepsilon_i,\\ \varepsilon_i &\overset{\text{iid}}{\sim}N(0, \sigma^2_\varepsilon). \end{align*}

For any particular value for the predictors $\tilde{x} = (1, \tilde{x}_1,\ldots,\tilde{x}_p)^\top$:

\begin{align*} \mu_{y\mid \tilde{x}} = E(y\mid x = \tilde{x}) &= \beta_0 + \beta_1\tilde{x}_{1} + \ldots + \beta_p\tilde{x}_{p}\\ &= \tilde{x}^\top\beta \end{align*}

New: Inference for Conditional Expectations

After running OLS regression and obtaining my estimate $\hat{\beta}$:

\begin{align*} \hat{\mu}_{y\mid \tilde{x}} = \hat{E}(y\mid x = \tilde{x}) &= \hat{\beta}_0 + \hat{\beta}_1\tilde{x}_{1} + \ldots + \hat{\beta}_p\tilde{x}_{p}\\ &= \tilde{x}^\top\hat{\beta} \end{align*}

How can I perform inference on $\mu_{y\mid \tilde{x}}$ (hypothesis tests, confidence intervals, etc...)?

Understanding the Distinction

Inference on $\mu_y$

Across the entire population of individuals, what's the expected height of male?
Unconditional expectation of male height.

Inference on $\mu_{y\mid \tilde{x}}$

Across the entire population of individuals whose fathers were 76 inches tall, what's the expected height of a male?
Conditional expectation of male height: condition on the value of an explanatory variable (here, height of father).
$\tilde{x} = (1, 76)^\top$