Estimation for linear models
Interpreting linear models
Statistical Properties of $\hat{\beta}$
The model is:
\[ y_i = \beta_0 + \beta_1 x_{i,1} + ... + \beta_px_{i,p} + \varepsilon_i,~~i=1, \dots, n. \]Goal: given $({x}_i, y_i),~i=1,\ldots,n$, estimate $\beta_0, \beta_1,...,\beta_p$
We adopt the same least squares criterion
\[\textstyle \min_{\beta_0, \beta_1,..,\beta_p}\sum_{i=1}^n (y_i - (\beta_0 + \beta_1 x_{i,1}+...+\beta_px_{i,p}))^2 \]For any choice of $\{\beta_j\}$, what would the sum of squared errors from the response surface be? Choose estimators to minimize this error.
We estimate the parameters using least squares, i.e. calculate
\[\textstyle \mbox{arg min}_{\beta_0, \beta_1, \ldots, \beta_p} \sum_{i=1}^n (y_i - (\beta_0 + \beta_1 x_{i,1} + \cdots + \beta_p x_{i,p}))^2 \]Let $X$ be the design matrix and $y$ be the response vector:
\[ y = \left( \begin{array}{c} y_1 \\ \vdots \\ y_n \end{array} \right), ~~X = \left ( \begin{array}{cccc} 1 & x_{1,1} & \cdots & x_{1,p} \\ \vdots & \vdots & x_{i,j} & \vdots \\ 1 & x_{n,1} & \cdots & x_{n,p} \end{array} \right). \]Further let $\beta$ be the (vector of) regression coefficients/parameters and $\varepsilon$ be the error vector
\[ \beta = \left( \begin{array}{c} \beta_0 \\ \vdots \\ \beta_p \end{array} \right), ~~\varepsilon = \left( \begin{array}{c} \varepsilon_1 \\ \vdots \\ \varepsilon_n \end{array} \right) \]Then we can write the model in matrix notation:
\[ {\color{blue}y}_{n \times 1} {\color{blue}=} {\color{blue}X}_{n \times (p+1)} {\color{blue}\beta}_{(p+1) \times 1} {\color{blue}+} {\color{blue}\varepsilon}_{n \times 1}. \]where:
$\textbf{E}(y) = X\beta \Leftrightarrow \textbf{E}(\varepsilon) = 0$
$\textbf{E}(y) = (\textbf{E}(y_1), \textbf{E}(y_2),...,\textbf{E}(y_n))^\top $; $\textbf{E}(\varepsilon) = (\textbf{E}(\varepsilon_1),....,\textbf{E}(\varepsilon_n))^\top $.
$\text{Var}(y) = \text{Var}(\varepsilon) = \sigma^2_{\varepsilon} {\color{blue}I}_{n\times n}$
$I$ is the identity matrix (1s on the diagonal, zeroes on the off-diagonal)
In general, if $Z \in \mathbf{R}^n$ is a random vector, then $\text{Var}(Z)$ is an $n\times n$ matrix with $\text{cov}(Z_i, Z_j)$ in the $ij$ entry.
Observe $y$ and $X$. How do we estimate $\beta$?
Answer: Choose the $\beta$ that minimize the sum of squared errors
Least squares criterion, translated to matrix form:
\[ \begin{aligned}\textstyle \min_{\beta} \sum_{i=1}^n \varepsilon_i^2 &= \textstyle \min_{\beta} \varepsilon^\top \varepsilon \\ &= \textstyle \min_{\beta} (y-X\beta)^\top (y-X\beta) \\ &= \textstyle \min_{\beta} y^\top y - 2y^\top X\beta + \beta^\top X^\top X\beta \end{aligned} \]We 1st differentiate the criterion with respect to $\beta$. Doing so, we attain
\[\textstyle \frac{\partial}{\partial \beta}(y-X\beta)^\top (y-X\beta) = -2X^\top y + 2(X^\top X)\beta \]To find the minimizer, we now set the derivative equal to zero (our objective function is convex). This yields the normal equation that our estimator must satisfy:
\[ X^\top X \hat{\beta} = X^\top y \]"Hat" notation: used for estimates
We can isolate $\hat{\beta}$ on the left-hand side, providing a unique solution, so long as $X^\top X$ is invertible $\Leftrightarrow \text{rank}(X) = p+1$.
When can this fail?
$p+1 > n$ (more predictor variables than we have observations). Common in genome-wide association studies, for instance.
Collinearity: a column of $X$ can be expressed as a linear combination of the others
If $X^\top X$ is invertible, unique solution:
\[ \hat{\beta}_{(p+1)\times 1} = (X^\top X)^{-1}X^\top y \]first entry of $\hat{\beta}$ is the estimated intercept
remaining $p$ entries are the estimated slopes for each predictor variable
These are the coefficients reported by R when using the lm command.
Fitted values for the $n$ observations in our data set
\[ \begin{aligned} \hat{y}_i &= \hat{\beta}_0 + \hat{\beta}_1 x_{i,1} + \cdots + \hat{\beta}_p x_{i,p} = x_i^\top \hat{\beta}\;\; (i=1,...,n)\\ \hat{y} &= X\hat{\beta} \end{aligned} \]Fitted model (for other values of ${x})$
\[ \hat{f}({x}) = \hat {\beta}_0 + \hat{\beta}_1 x_{1} + \cdots + \hat{\beta}_p x_{p} = x^\top \hat{\beta} \]Residuals:
\[ \begin{aligned} e_i &= y_i - \hat{y}_i = y_i - x_i^\top \hat{\beta}\;\; (i=1,...,n)\\ e &= y -\hat{y} = y - X\hat{\beta} \end{aligned} \]Note that the residuals $e_i$ are different from the error terms $\varepsilon_i$ in our statement of the regression model. Why?
We introduce an additional quantity called the "Hat" matrix, which will be useful in what follows.
$\hat{y} = X\hat{\beta} = X\left({X}^\top X\right)^{-1}{X}^\top y = Hy$, where $$ {\color{blue}H = X\left({X}^\top X\right)^{-1}{X}^\top } $$ (Why the "hat" matrix? It puts the "hats" on $y$!)
Fitted values: $\hat{y} = Hy$
Residuals: $e = y - \hat{y} = (I-H)y$, where $I$ is the $n\times n$ identity matrix
In general, a matrix $P$ is an orthogonal projection matrix if:
${P}^\top = P$ ($P$ is symmetric).
$PP = P$ ($P$ is idempotent).
Verify: $H = X\left({X}^\top X\right)^{-1}{X}^\top $ satisfies these two conditions
$H$ is an orthogonal projection matrix onto the columnspace of $X$. It projects the responses $y$ onto the column space of $X$, returning $\hat{y} = Hy = X\hat{\beta}$.
The column space of $X$ is the set of vectors $a$ which may be expressed as $a=Xb$ for some $b$.
$\hat{y} = Hy = X\hat{\beta}$ is the vector that is closest (in terms of Euclidean distance) to $y$, subject to residing in the column space of $X$
$\min_{\beta} {(y-X\beta)}^\top (y-X\beta)=\|y-X\beta\|_2^2$ can be interpreted as minimizing the Euclidean distance between $y$ and the linear space spanned by the columns of $X$ (the column space of $X$).
The following orthogonality holds between the residual vector $e$ and elements of the column space of $X$ (also referred to in regression as "predictor space"):
Orthogonality of Residuals and Predictors
For any vector $a \in \mathbf{R}^n$ in column space of $X$ (i.e. any vector $a\in \mathbf{R}^n$ which can be written as $a = Xb$ for some vector $b\in \mathbf{R}^{p+1}$)
\begin{align*} {a}^\top e = 0 \end{align*}Intuition: residuals are what's "left over" after running our regression, that which cannot be predicted or explained by a linear function of our predictor variables
Recall that under the weaker linear model, the mean function takes the form
\[ \textbf{E}(y_i \mid x_i) = \beta_0 + \beta_1 x_{i,1} + ... + \beta_p x_{i,p} = x_i^\top \beta \]Do the parameters $\beta_0,\beta_1,....,\beta_p$ have natural interpretations?
Ideally, we'd like to relate scientific hypotheses about relationships between variables to numerical values for parameters in the model
Intercept ($\beta_0$)
$\beta_0$ is the true value for $\textbf{E}(y_i\mid {x}_i)$ when $x_{i,1} = x_{i,2}=...=x_{i,p} = 0$; $\hat{\beta}_0$ is its estimate.
We don't often concern ourselves with the intercept. Oftentimes the point $x_{i,1}=x_{i,2}=\dots =x_{i,p}=0$ is an extrapolation (for instance, if all the covariates take on positive values).
Generally, we simply think of the intercept as providing the proper vertical placement for the mean function $\textbf{E}(y_i \mid x_i)$.
Slope coefficients $(\beta_1,...,\beta_p)$ provide comparisons between the expectations at different points $x$
Interpretatation of $\beta_j$: Ceteris Paribus
Holding all other variables equal (ceteris paribus), two individuals who differ in variable $x_j$ by 1 unit are expected to differ in $Y$ by $\beta_j$ units.
That is, $\beta_j$ is the difference in expected responses for two individuals who differ in covariate $x_j$ by 1 unit, but have the same values for all of the other predictor variables $\{1,...,p\}\setminus \{j\}$ (i.e., all else equal).
$\hat{\beta}_j$ is our estimate of this difference in expectations.
Notice the very specific verbiage: All else equal, two individuals who differ in $x_j$ by one unit are expected to differ in $y$ by $\beta_j$ units...
In algebra, we said that all else equal, a change in $x_j$ by 1 unit results in a change in $y$ by $\beta_j$.
That is ascribing a causal interpretation to the slope
We usually don't get to change the values of $x_j$ for a particular individual (unless we're running an experiment); rather, we observe different values for $x_j$ for different individuals, along with the differences in their responses.
Two individuals who differ in $x$ by one unit might differ based on other omitted variables.
Correlation is not causation.
> \text{cor}(sales, income)
[1] 0.6484524
(Intercept) income
120.458 6.081
> \text{cor}(sales, competitors)
[1] -0.05364514
(Intercept) competitors
517.434 -3.595
Danger in drawing causal conclusions from slope coefficients: Just because the slope is near zero doesn't mean there isn't an effect.
Our interpretation of the slope says nothing about what will happen when I change $x$. Rather, it's an observational comparison between two predetermined values for $x$.
We have to worry about omitted variables
Using a hypothesis test we'll derive in a few lectures, the slope coefficient on competitors in
the simple regression of sales on competitors is statistically insignificant.
Can we think of an omitted variable that also is predictive of sales?
Let's compare the slopes on competitors in the simple and multiple regressions
Simple Regression
(Intercept) competitors
517.434 -3.595
For two stores where one location has one more competitor than the other, our estimate of the expected difference in sales per square foot is -$3.60.
Multiple Regression
(Intercept) competitors income
91.422 -25.135 7.495
For two stores with the same median income of consumers but where one location has one more competitor than the other, our estimate of the expected difference in sales per square foot is -$25.14.
With the interpretation of $\hat{\beta}$ established, we'll next investigate statistical properties of $\hat{\beta}$ when viewed as an estimator of ${\beta}$
Is it unbiased?
Can we characterize the variability of $\hat{\beta}$?
Why should we prefer $\hat{\beta}$, found by minimizing the sum of square error, over some other estimator of $\beta$ we might concoct?
First, let's remind ourselves of the source of statistical uncertainty in the first place:
I only have one data set in front of me. Using that one data set, I run OLS using the lm
command in R. This spits out an intercept, along with slope coefficients.
Why is this viewed as random? Where does the randomness come from?
The source is no different from what you learned in your first statistics class when investigating the distribution of the sample mean, $\bar{y}$
If I had observed a different dataset, I would have observed different responses.
These different responses would yield a different sample mean, sample slope, ....
Parallel universe interpretation - things could have gone differently in the dataset at my disposal.
In terms of our model: had I observed a different vector $\varepsilon$, I would have observed different responses. These would have generated different slope and intercept coefficients when running our regression
As we'll be dealing with random vectors, let's recall a few definitions and properties. Let $Y, Z\in \mathbf{R}^m$ be a random vector.
\begin{eqnarray*} Z &=& {(Z_1, \ldots Z_m)}^\top \\ \textbf{E}(Z) &=& {(\textbf{E}(Z_1), \ldots \textbf{E}(Z_m))}^\top \end{eqnarray*}Properties of Expectation
For $A\in \mathbf{R}^{n\times m}$, $b\in \mathbf{R}^n$, $A$ and $b$ constant,
\begin{align*} \textbf{E}(AY + b) &= A\;\textbf{E}(Y) + b\\ \textbf{E}(Y+Z) &= \textbf{E}(Y) + \textbf{E}(Z) \end{align*}For $Z\in \mathbf{R}^m$
\begin{align*} \text{Var}(Z) &= E[(Z-\textbf{E}(Z)){(Z-\textbf{E}(Z))}^\top ]\\ &= \textbf{E}(Z{Z}^\top ) - \textbf{E}(Z){\textbf{E}(Z)}^\top \\ &=\left( \begin{array}{cccc} \text{var}(Z_1) & \cdots & \cdots & \text{cov}(Z_1, Z_m) \\ \text{cov}(Z_2, Z_1) & \text{cov}(Z_2,Z_2) & \cdots & \text{cov}(Z_2, Z_m) \\ \vdots & \ddots & \ddots & \vdots \\ \text{cov}(Z_m, Z_1) & \cdots & \cdots & \text{var}(Z_m) \end{array} \right) \end{align*}If $\{Z_j\}$ are independent, $\text{Var}(Z) = Diag(\text{var}(Z_j))$
Properties of Variance
For $A\in \mathbf{R}^{n\times m}$, $b \in \mathbf{R}^n$, $A$ and $b$ constant,
\begin{align*} \text{Var}(AZ + b) &= A\;\text{Var}(Z){A}^\top \end{align*}