STATS 413: Lecture 4

Estimation for linear models
Interpreting linear models
Statistical Properties of $\hat{\beta}$

Finding the Line of Best Fit

The model is:
\[ y_i = \beta_0 + \beta_1 x_{i,1} + ... + \beta_px_{i,p} + \varepsilon_i,~~i=1, \dots, n. \]
Goal: given $({x}_i, y_i),~i=1,\ldots,n$, estimate $\beta_0, \beta_1,...,\beta_p$
We adopt the same least squares criterion
\[\textstyle \min_{\beta_0, \beta_1,..,\beta_p}\sum_{i=1}^n (y_i - (\beta_0 + \beta_1 x_{i,1}+...+\beta_px_{i,p}))^2 \]
For any choice of $\{\beta_j\}$, what would the sum of squared errors from the response surface be? Choose estimators to minimize this error.

Least Squares Estimate

We estimate the parameters using least squares, i.e. calculate

\[\textstyle \mbox{arg min}_{\beta_0, \beta_1, \ldots, \beta_p} \sum_{i=1}^n (y_i - (\beta_0 + \beta_1 x_{i,1} + \cdots + \beta_p x_{i,p}))^2 \]

Least Squares Estimate

Matrix Notation

Let $X$ be the design matrix and $y$ be the response vector:

\[ y = \left( \begin{array}{c} y_1 \\ \vdots \\ y_n \end{array} \right), ~~X = \left ( \begin{array}{cccc} 1 & x_{1,1} & \cdots & x_{1,p} \\ \vdots & \vdots & x_{i,j} & \vdots \\ 1 & x_{n,1} & \cdots & x_{n,p} \end{array} \right). \]

Further let $\beta$ be the (vector of) regression coefficients/parameters and $\varepsilon$ be the error vector

\[ \beta = \left( \begin{array}{c} \beta_0 \\ \vdots \\ \beta_p \end{array} \right), ~~\varepsilon = \left( \begin{array}{c} \varepsilon_1 \\ \vdots \\ \varepsilon_n \end{array} \right) \]

Matrix Notation

Then we can write the model in matrix notation:

\[ {\color{blue}y}_{n \times 1} {\color{blue}=} {\color{blue}X}_{n \times (p+1)} {\color{blue}\beta}_{(p+1) \times 1} {\color{blue}+} {\color{blue}\varepsilon}_{n \times 1}. \]

where:

$\textbf{E}(y) = X\beta \Leftrightarrow \textbf{E}(\varepsilon) = 0$
- $\textbf{E}(y) = (\textbf{E}(y_1), \textbf{E}(y_2),...,\textbf{E}(y_n))^\top $; $\textbf{E}(\varepsilon) = (\textbf{E}(\varepsilon_1),....,\textbf{E}(\varepsilon_n))^\top $.
$\text{Var}(y) = \text{Var}(\varepsilon) = \sigma^2_{\varepsilon} {\color{blue}I}_{n\times n}$
- $I$ is the identity matrix (1s on the diagonal, zeroes on the off-diagonal)
- In general, if $Z \in \mathbf{R}^n$ is a random vector, then $\text{Var}(Z)$ is an $n\times n$ matrix with $\text{cov}(Z_i, Z_j)$ in the $ij$ entry.

Estimate $\beta$

Observe $y$ and $X$. How do we estimate $\beta$?

Answer: Choose the $\beta$ that minimize the sum of squared errors

Least squares criterion, translated to matrix form:

\[ \begin{aligned}\textstyle \min_{\beta} \sum_{i=1}^n \varepsilon_i^2 &= \textstyle \min_{\beta} \varepsilon^\top \varepsilon \\ &= \textstyle \min_{\beta} (y-X\beta)^\top (y-X\beta) \\ &= \textstyle \min_{\beta} y^\top y - 2y^\top X\beta + \beta^\top X^\top X\beta \end{aligned} \]

We 1st differentiate the criterion with respect to $\beta$. Doing so, we attain

\[\textstyle \frac{\partial}{\partial \beta}(y-X\beta)^\top (y-X\beta) = -2X^\top y + 2(X^\top X)\beta \]

Estimate $\beta$

To find the minimizer, we now set the derivative equal to zero (our objective function is convex). This yields the normal equation that our estimator must satisfy:

\[ X^\top X \hat{\beta} = X^\top y \]

"Hat" notation: used for estimates

We can isolate $\hat{\beta}$ on the left-hand side, providing a unique solution, so long as $X^\top X$ is invertible $\Leftrightarrow \text{rank}(X) = p+1$.

Estimate $\beta$

When can this fail?

$p+1 > n$ (more predictor variables than we have observations). Common in genome-wide association studies, for instance.
Collinearity: a column of $X$ can be expressed as a linear combination of the others

OLS Estimators

If $X^\top X$ is invertible, unique solution:

\[ \hat{\beta}_{(p+1)\times 1} = (X^\top X)^{-1}X^\top y \]

first entry of $\hat{\beta}$ is the estimated intercept
remaining $p$ entries are the estimated slopes for each predictor variable

These are the coefficients reported by R when using the lm command.

Additional Quantities (in vector notation)

Fitted values for the $n$ observations in our data set
\[ \begin{aligned} \hat{y}_i &= \hat{\beta}_0 + \hat{\beta}_1 x_{i,1} + \cdots + \hat{\beta}_p x_{i,p} = x_i^\top \hat{\beta}\;\; (i=1,...,n)\\ \hat{y} &= X\hat{\beta} \end{aligned} \]
Fitted model (for other values of ${x})$
\[ \hat{f}({x}) = \hat {\beta}_0 + \hat{\beta}_1 x_{1} + \cdots + \hat{\beta}_p x_{p} = x^\top \hat{\beta} \]

Additional Quantities (in vector notation)

Residuals:
\[ \begin{aligned} e_i &= y_i - \hat{y}_i = y_i - x_i^\top \hat{\beta}\;\; (i=1,...,n)\\ e &= y -\hat{y} = y - X\hat{\beta} \end{aligned} \]

Note that the residuals $e_i$ are different from the error terms $\varepsilon_i$ in our statement of the regression model. Why?

Hat Matrix

We introduce an additional quantity called the "Hat" matrix, which will be useful in what follows.

$\hat{y} = X\hat{\beta} = X\left({X}^\top X\right)^{-1}{X}^\top y = Hy$, where $$ {\color{blue}H = X\left({X}^\top X\right)^{-1}{X}^\top } $$ (Why the "hat" matrix? It puts the "hats" on $y$!)
Fitted values: $\hat{y} = Hy$
Residuals: $e = y - \hat{y} = (I-H)y$, where $I$ is the $n\times n$ identity matrix

Projection

In general, a matrix $P$ is an orthogonal projection matrix if:

${P}^\top = P$ ($P$ is symmetric).
$PP = P$ ($P$ is idempotent).

Verify: $H = X\left({X}^\top X\right)^{-1}{X}^\top $ satisfies these two conditions

Projection

$H$ is an orthogonal projection matrix onto the columnspace of $X$. It projects the responses $y$ onto the column space of $X$, returning $\hat{y} = Hy = X\hat{\beta}$.

The column space of $X$ is the set of vectors $a$ which may be expressed as $a=Xb$ for some $b$.
$\hat{y} = Hy = X\hat{\beta}$ is the vector that is closest (in terms of Euclidean distance) to $y$, subject to residing in the column space of $X$

Vector Space Interpretation

$\min_{\beta} {(y-X\beta)}^\top (y-X\beta)=\|y-X\beta\|_2^2$ can be interpreted as minimizing the Euclidean distance between $y$ and the linear space spanned by the columns of $X$ (the column space of $X$).

Predictor Space and Residual Space

The following orthogonality holds between the residual vector $e$ and elements of the column space of $X$ (also referred to in regression as "predictor space"):

Orthogonality of Residuals and Predictors

For any vector $a \in \mathbf{R}^n$ in column space of $X$ (i.e. any vector $a\in \mathbf{R}^n$ which can be written as $a = Xb$ for some vector $b\in \mathbf{R}^{p+1}$)

\begin{align*} {a}^\top e = 0 \end{align*}

Predictor Space and Residual Space

Intuition: residuals are what's "left over" after running our regression, that which cannot be predicted or explained by a linear function of our predictor variables

Interpreting Parameters of Mean Function

Recall that under the weaker linear model, the mean function takes the form

\[ \textbf{E}(y_i \mid x_i) = \beta_0 + \beta_1 x_{i,1} + ... + \beta_p x_{i,p} = x_i^\top \beta \]

Do the parameters $\beta_0,\beta_1,....,\beta_p$ have natural interpretations?
Ideally, we'd like to relate scientific hypotheses about relationships between variables to numerical values for parameters in the model

Interpretation of Intercept

Intercept ($\beta_0$)

$\beta_0$ is the true value for $\textbf{E}(y_i\mid {x}_i)$ when $x_{i,1} = x_{i,2}=...=x_{i,p} = 0$; $\hat{\beta}_0$ is its estimate.
We don't often concern ourselves with the intercept. Oftentimes the point $x_{i,1}=x_{i,2}=\dots =x_{i,p}=0$ is an extrapolation (for instance, if all the covariates take on positive values).
Generally, we simply think of the intercept as providing the proper vertical placement for the mean function $\textbf{E}(y_i \mid x_i)$.

Interpretation of Slopes

Slope coefficients $(\beta_1,...,\beta_p)$ provide comparisons between the expectations at different points $x$

Interpretatation of $\beta_j$: Ceteris Paribus

Holding all other variables equal (ceteris paribus), two individuals who differ in variable $x_j$ by 1 unit are expected to differ in $Y$ by $\beta_j$ units.

Interpretation of Slopes

That is, $\beta_j$ is the difference in expected responses for two individuals who differ in covariate $x_j$ by 1 unit, but have the same values for all of the other predictor variables $\{1,...,p\}\setminus \{j\}$ (i.e., all else equal).

$\hat{\beta}_j$ is our estimate of this difference in expectations.

What the Slope is Not

Notice the very specific verbiage: All else equal, two individuals who differ in $x_j$ by one unit are expected to differ in $y$ by $\beta_j$ units...

In algebra, we said that all else equal, a change in $x_j$ by 1 unit results in a change in $y$ by $\beta_j$.
That is ascribing a causal interpretation to the slope
We usually don't get to change the values of $x_j$ for a particular individual (unless we're running an experiment); rather, we observe different values for $x_j$ for different individuals, along with the differences in their responses.
Two individuals who differ in $x$ by one unit might differ based on other omitted variables.

Correlation is not causation.

Back to the Mall

> \text{cor}(sales, income)
[1] 0.6484524

(Intercept)       income
    120.458        6.081

> \text{cor}(sales, competitors)
[1] -0.05364514

(Intercept)  competitors
    517.434       -3.595

Observational Comparisons and Slope Coefficients

Danger in drawing causal conclusions from slope coefficients: Just because the slope is near zero doesn't mean there isn't an effect.

Our interpretation of the slope says nothing about what will happen when I change $x$. Rather, it's an observational comparison between two predetermined values for $x$.
We have to worry about omitted variables

Using a hypothesis test we'll derive in a few lectures, the slope coefficient on competitors in the simple regression of sales on competitors is statistically insignificant.

Can we think of an omitted variable that also is predictive of sales?

Understanding the Difference in Interpretation

Let's compare the slopes on competitors in the simple and multiple regressions

Simple Regression

(Intercept)  competitors
    517.434       -3.595

For two stores where one location has one more competitor than the other, our estimate of the expected difference in sales per square foot is -$3.60.

Understanding the Difference in Interpretation

Multiple Regression

(Intercept)  competitors       income
     91.422      -25.135        7.495

For two stores with the same median income of consumers but where one location has one more competitor than the other, our estimate of the expected difference in sales per square foot is -$25.14.

Properties of $\hat{\beta}$

With the interpretation of $\hat{\beta}$ established, we'll next investigate statistical properties of $\hat{\beta}$ when viewed as an estimator of ${\beta}$

Is it unbiased?
Can we characterize the variability of $\hat{\beta}$?
Why should we prefer $\hat{\beta}$, found by minimizing the sum of square error, over some other estimator of $\beta$ we might concoct?

Properties of $\hat{\beta}$

First, let's remind ourselves of the source of statistical uncertainty in the first place:

I only have one data set in front of me. Using that one data set, I run OLS using the lm command in R. This spits out an intercept, along with slope coefficients.
Why is this viewed as random? Where does the randomness come from?

Dataset-to-Dataset Variability

The source is no different from what you learned in your first statistics class when investigating the distribution of the sample mean, $\bar{y}$

If I had observed a different dataset, I would have observed different responses.
These different responses would yield a different sample mean, sample slope, ....
Parallel universe interpretation - things could have gone differently in the dataset at my disposal.
In terms of our model: had I observed a different vector $\varepsilon$, I would have observed different responses. These would have generated different slope and intercept coefficients when running our regression

Expectations of Random Vectors

As we'll be dealing with random vectors, let's recall a few definitions and properties. Let $Y, Z\in \mathbf{R}^m$ be a random vector.

\begin{eqnarray*} Z &=& {(Z_1, \ldots Z_m)}^\top \\ \textbf{E}(Z) &=& {(\textbf{E}(Z_1), \ldots \textbf{E}(Z_m))}^\top \end{eqnarray*}

Properties of Expectation

For $A\in \mathbf{R}^{n\times m}$, $b\in \mathbf{R}^n$, $A$ and $b$ constant,

\begin{align*} \textbf{E}(AY + b) &= A\;\textbf{E}(Y) + b\\ \textbf{E}(Y+Z) &= \textbf{E}(Y) + \textbf{E}(Z) \end{align*}

Covariance

For $Z\in \mathbf{R}^m$

\begin{align*} \text{Var}(Z) &= E[(Z-\textbf{E}(Z)){(Z-\textbf{E}(Z))}^\top ]\\ &= \textbf{E}(Z{Z}^\top ) - \textbf{E}(Z){\textbf{E}(Z)}^\top \\ &=\left( \begin{array}{cccc} \text{var}(Z_1) & \cdots & \cdots & \text{cov}(Z_1, Z_m) \\ \text{cov}(Z_2, Z_1) & \text{cov}(Z_2,Z_2) & \cdots & \text{cov}(Z_2, Z_m) \\ \vdots & \ddots & \ddots & \vdots \\ \text{cov}(Z_m, Z_1) & \cdots & \cdots & \text{var}(Z_m) \end{array} \right) \end{align*}

Covariance

If $\{Z_j\}$ are independent, $\text{Var}(Z) = Diag(\text{var}(Z_j))$

Properties of Variance

For $A\in \mathbf{R}^{n\times m}$, $b \in \mathbf{R}^n$, $A$ and $b$ constant,

\begin{align*} \text{Var}(AZ + b) &= A\;\text{Var}(Z){A}^\top \end{align*}