STATS 413: Lecture 11

  1. Regression diagnostics

  2. Assessing linearity

  3. Assessing homoskedasticity

  4. Assessing normality

If the Linear Model Holds...

To date we have derived properties of the ordinary least squares coefficients $\widehat{\boldsymbol\beta}$ under the assumption that a linear model holds. The weaker linear model states:

\[ y = X\beta + \varepsilon;\;\; \text{E}(\varepsilon)=0;\;\; \text{Var}(\varepsilon) = \sigma^2_\varepsilon I_{n\times n}. \]

If this holds, we showed among many things:

  • $\text{E}(\widehat{\boldsymbol\beta}) = \beta$; $\text{Var}(\widehat{\boldsymbol\beta}) = \sigma^2_\varepsilon (X^\top X)^{-1}$; $\text{E}(y - X\widehat{\boldsymbol\beta}) = 0$.

If the Linear Model Holds...

If we further assume that $\varepsilon \sim \text{MVN}(0, \sigma^2_\varepsilon I_{n\times n})$, we were able to construct:

  • Hypothesis tests/confidence intervals for slope coefficients

  • Hypothesis tests to compare models of differing complexity

  • Confidence intervals for the conditional expectation

  • Prediction intervals for future observations

The Core Assumptions

The stronger linear model has the following core components:

  1. Linearity: At any $x_i$, $\text{E}(y \mid x_i) = \beta_0 + \beta_1x_{i1} + \dots + \beta_px_{ip} = x_i^\top\beta$

  2. Homoskedasticity: For any $x_i$, $\text{Var}(\varepsilon_i\mid x_i) = \sigma^2_\varepsilon$ for all $i$.

  3. Normality: $\varepsilon_i$ are $iid$ normally distributed

The Core Assumptions

Violations of these assumptions can invalidate the methods we've proposed. For instance...

  • $\text{E}(y-X\widehat{\boldsymbol\beta}) \neq 0$

  • Our hypothesis tests might commit a Type I error at a rate different from $\alpha$

  • The coverage of our confidence intervals might differ from $100(1-\alpha)\%$.

  • Our prediction intervals for future observations $y^*$ might not capture future observations at rate $100(1-\alpha)\%$.

Regression Diagnostics

Regression Diagnostics are tools/methods for determining whether a regression model adequately represents the data at hand

  • Do the assumptions of the stronger linear model seem reasonable?

  • Or is there a notable violation of one or more underlying assumptions?

Regression Diagnostics

Diagnostics take two main forms:

  1. Diagnostic Plots – create a plot which can visually reveal a violation of a modeling assumption.

  2. Diagnostic Tests – test the null hypothesis that a given modeling assumption holds. Rejection of the null suggests a violation of the assumption under consideration.

We'll mainly discuss graphical diagnostics, but note that there are often well-established numeric diagnostic tests for these tools.

Devising a Diagnostic Plot

When devising a diagnostic plot, we need to consider

  1. What the plot should display when the assumptions of the linear model hold

  2. What might appear within the plot should an assumption be violated

Issue: can't visualize relationship between $y$ and $x_1,\dots,x_p$ unless $p=1$ (or $p=2$ with appropriate plotting software).

  • We need to consider low-dimensional consequences of the linear model

Properties of Residuals and Fitted Values

Many diagnostics will use the residuals $e$, and/or the fitted values $\hat{y}$. Let $H = X(X^\top X)^{-1}X^\top$ be the hat matrix, and let $h_{ij}$ denote its $i,j$ element.

\begin{align*} e &= (I-H)y\\ \hat{y} &= Hy \end{align*}

Properties of Residuals and Fitted Values

We've previously discussed algebraic properties of $e$ and $\hat{y}$ when running a regression with an intercept. Letting $\widehat{\text{cov}}(\cdot, \cdot)$ be the sample covariance operator:

\begin{align*} \widehat{\text{cov}}(e, x_j) &= 0\;\; (j=1,\dots,p)\\ \widehat{\text{cov}}(e, \hat{y}) &= 0\\\textstyle \bar{e} = \frac{1}{n}\sum_{i=1}^n e_i &= 0 \end{align*}

These properties are purely algebraic, and hold regardless of whether the weaker linear model holds.

Expectation of Residuals and Fitted Values

We'll now derive additional properties of $e$ and $\hat{y}$ under the assumption that the weaker or stronger linear model holds.

  • Properties represent testable implications of these models.

Expectation of Residuals and Fitted Values

Under the weaker linear model

\begin{align*} \text{E}(e) &= \text{E}((I-H)y) = 0\\ \text{E}(\hat{y}) &= \text{E}(Hy) = X\beta \end{align*}

Expectation of Residuals and Fitted Values

Important to keep in mind the distinction between

  • $\text{E}(e_i) = 0$

  • $\bar{e} = 0$

The former is a statement about the expectation for any particular residual. The latter merely says that the sample average of the residuals equals zero in any particular data set, but doesn't describe components $e_i$.

Variance of Residuals

We can also derive the variance of the residual vector.

  • This variance is across data sets generated by the linear model

Variance of Residuals and Fitted Values

Under the weaker linear model

\begin{align*} \text{Var}(e) &= \sigma^2_\varepsilon (I-H)\\ \text{Var}(\hat{y}) &= \sigma^2_\varepsilon H \end{align*}

Variance of Residuals

Keep in mind the distinction between

  • $\text{var}(e_i) = \sigma^2_\varepsilon (1-h_{ii})$

  • $\hat{\sigma}^2_\varepsilon = \sum_{i=1}^ne_i^2/(n-p-1)$

The former describes variability for any component $e_i$ across data sets. The latter is the sample variance for the observed residuals.

Diagonals of Hat Matrix

When performing regression with an intercept and $p$ predictors, the diagonal entries of the hat matrix $H$, $h_{ii}$, satisfy the following:

  • $\frac{1}{n}\leq h_{ii}\leq 1 \;\; (i=1,\dots,n)$

  • $\sum_{i=1}^n h_{ii} = p+1$

Implication:

\[ \text{var}(e_i) = \sigma^2_\varepsilon (1-h_{ii}) \leq \sigma^2_\varepsilon = \text{var}(\varepsilon_i). \]

In words:

  • Residuals don't generally have the same variance, $\text{var}(e_i)\neq \text{var}(e_{j})$ for $i\neq j$

  • Residuals $e_i$ have smaller variance than $\varepsilon_i$

Off-Diagonals of Hat Matrix

For any $i,j = 1,\dots,n$

\begin{align*} \text{cov}(e_i, e_{j}) &= -\sigma^2_\varepsilon h_{ij}\\ \text{cov}(\hat{y}_i, \hat{y}_j) &= \sigma^2_\varepsilon h_{ij} \end{align*}

For regression with an intercept, we also have the bound

\[ 1/n - 1/2 \leq h_{ij} \leq 1/2 \]

Implication: despite the fact that $\text{cov}(\varepsilon_i, \varepsilon_j)=0$ under the weaker linear model, $\text{cov}(e_i, e_{j}) \neq 0$.

Covariance Between Vectors

For $Z\in \mathbb{R}^m$, $Y\in \mathbb{R}^n$ random vectors:

\begin{align*} \text{Cov}(Z,Y) &= \text{E}[(Z-\text{E}(Z))(Y-\text{E}(Y))^\top]\\ &= \text{E}(ZY^\top) - \text{E}(Z)\text{E}(Y)^\top\\ &=\left( \begin{array}{cccc} \text{cov}(Z_1, Y_1) & \cdots & \cdots & \text{cov}(Z_1, Y_n) \\ \text{cov}(Z_2, Y_1) & \text{cov}(Z_2,Y_2) & \cdots & \text{cov}(Z_2, Y_n) \\ \vdots & \ddots & \ddots & \vdots \\ \text{cov}(Z_m, Y_1) & \cdots & \cdots & \text{cov}(Z_m, Y_n) \end{array} \right) \end{align*}

Covariance Between Vectors

$\text{Cov}(Z,Y) \in \mathbb{R}^{m\times n}$; $\text{Cov}(Z,Y) = \text{Cov}(Y,Z)^\top$. $\text{Cov}(Z,Z) = \text{Var}(Z)$.

Properties of Covariance

For $A\in \mathbb{R}^{d\times m}$, $B\in \mathbb{R}^{k\times n}$:

\[ \text{Cov}(AZ, BY) = A\text{Cov}(Z,Y)B^\top \]

Covariance Between Residuals and Fitted Values

Using this definition, we can characterize the covariance between residuals and fitted values under the weaker linear model:

Covariance of Residuals and Fitted Values

Under the weaker linear model

\[ \text{Cov}(e,\hat{y}) = \text{Cov}((I-H)y, Hy) = 0. \]

For any $i,j=1,\dots,n$, $\text{cov}(e_i, \hat{y}_j) = 0$.

Covariance Between Residuals and Fitted Values

This is distinct from the result that the sample covariance between $e$ and $\hat{y}$ equals zero when running regression with an intercept, $\widehat{\text{cov}}(e, \hat{y}) = 0$.

  • $\text{cov}(e_i, \hat{y}_i) = 0$: Property of random variables across data sets under the linear model.

  • $\widehat{\text{cov}}(e, \hat{y}) = 0$: Algebraic property.

The Distribution of Residuals and Fitted Values

Let's now consider the distribution of the residuals and fitted values:

Distribution of Residuals and Fitted Values

Under the stronger linear model,

\[ \left( \begin{array}{c} e \\ \hat{y} \end{array} \right) \sim \text{MVN}\left( \left( \begin{array}{c} 0 \\ X\beta \end{array} \right), \left( \begin{array}{c c} (I-H)\sigma^2_\varepsilon & 0 \\ 0 & H\sigma^2_\varepsilon \end{array} \right) \right) \]

Implications under the stronger linear model:

  • $e_i \sim N(0, (1-h_{ii})\sigma^2_\varepsilon)$.

  • $e$ and $\hat{y}$ are independent of one another (relies crucially on multivariate normality).

Linearity

Linearity, $y = X\beta +\varepsilon$ with $\text{E}(\varepsilon) = 0$, tells us that for each $i$,

\[ \text{E}(e_i\mid x_i) = \text{E}(y_i - x_i^\top\widehat{\boldsymbol\beta}\mid x_i) = 0, \]

i.e. that the expected value of our sample residual $e_i$ at any point $x_i$ equals zero.

Nonlinearity, on the other hand, could be written as:

\[ y_i = g(x_i) + \varepsilon_i;\;\; \text{E}(\varepsilon_i) = 0, \]

for some nonlinear function $g(\cdot)$. In this case:

\[ \text{E}(e_i\mid x_i) = \text{E}(y_i - x_i^\top\widehat{\boldsymbol\beta}\mid x_i) = \eta(x_i) \]

for some non-zero function $\eta(\cdot)$.

Residual Plots with Predictors

Expectation of sample residuals:

\[ \text{E}(e_i\mid x_i)=\text{E}(y_i - x_i^\top\widehat{\boldsymbol\beta}\mid x_i) = \eta(x_i) \]

Linearity: $\text{E}(e_i\mid x_i) = 0$ for all $x_i$

  • No pattern in residuals as a function of any of the predictor variables $x_{i1}, x_{i2},\dots,x_{ip}$.

Nonlinearity: $\text{E}(e_i\mid x_i) = \eta(x_i)$ for some nonlinear function $\eta(x_i)$

  • Can't generally visualize $e_i$ as joint function of $x_{i1},\dots,x_{ip}$

  • Can separately visualize relationship between $e_i$ and $x_{i1}$, $e_i$ and $x_{i2}$,... $e_i$ and $x_{ip}$ through $p$ scatterplots (2 dimensional projections).

Residual Plots with Predictors

Hope: high-dimensional nonlinearity reveals itself in lower dimensions

These plots with the residuals on the $y$ axis and predictors on the $x$ axis are examples of residual plots.

Residual Plots

Do you see clear evidence of curvature?

  • Ideal residual plot should look like an amorphous blob. No systematic variation between residuals and $x_i$

  • Are there values of $x_i$ where the residuals are predictably positive or negative?

  • Remember: residuals will always have mean zero (the red line on these plots)

Residual Plots with Predictors

In our mall sales example:

Residuals versus Fitted Values

In addition, we also plot the relationship between residuals and fitted values: $e$ versus $\hat{y}$

  • If I had access to $\beta$, we know that $\text{E}(e_i\mid x_i^\top\beta) = 0$ for all $x_i$.

  • A plot of $e$ against $\text{E}(y) = X\beta$ would reveal no discernable pattern under linearity

Replace $x_i^\top \beta$ by our estimate, $x_i^\top\widehat{\boldsymbol\beta} = \hat{y}_i$

  • Hope: nonlinearity might reveal itself with systematic variation in residuals as a function of predicted values.

Note: of all residual plots targeting nonlinearity, this plot is the most commonly used in practice

Residuals versus Fitted Values

In our mall sales example:

Assessing Homoskedasticity

Homoskedasticity tells us that, for each $x_i$:

\[ \text{var}(\varepsilon_i\mid x_i) = \sigma^2_\varepsilon \]

We don't get to observe $\varepsilon_i = y_i - x_i^\top\beta$, and instead observe its sample analogue $e_i = y_i - x_i^\top\widehat{\boldsymbol\beta}$. Can we use $e_i$ to assess homoskedasticity?

\[ \text{var}(e_i\mid x_i) = (1-h_{ii})\sigma^2_\varepsilon \]
  • Generally $h_{ii}\neq h_{jj}$ for $i\neq j$.

  • Even under homoskedasticity, our sample residuals would not have constant variance. $\text{var}(e_i\mid x_i)\neq \text{var}(e_j\mid x_j)$.

Standardized Residuals

Fortunately, there's a straightforward fix under the weaker linear model:

\[ \begin{aligned} &\text{var}\left(\frac{{e}_i}{\sqrt{1-h_{ii}}}\mid x_i\right) \\ &\quad=\text{Var}(e_i\mid x_i)/(1-h_{ii}) \\ &\quad=\sigma^2_\varepsilon(1-h_{ii})/(1-h_{ii}) \\ &\quad= \sigma^2_\varepsilon \end{aligned} \]

This motivates the use of an alternative residual, called the standardized residual, for assessing homoskedasticity

Standardized Residuals

Standardized / Internally Studentized Residuals

The standardized residual (also called the internally studentized residual) takes the following form for each observation $i$:

\[ r_i = \frac{e_i}{\hat{\sigma}_\varepsilon \sqrt{1-h_{ii}}} \]

(Division by $\hat{\sigma}_\varepsilon$ removes dependence on scale).

Detecting Heteroskedasticity

Common approaches using standardized residuals include:

  1. Plot $r_i$ against $x_{ij}$ for $j=1,\dots,p$

  2. Plot $r_i$ against $\hat{y}_i$

  3. Plot $\sqrt{|r_i|}$ against $x_{ij}$ for $j=1,\dots,p$

  4. Plot $\sqrt{|r_i|}$ against $\hat{y}_i$

Detecting Heteroskedasticity

With any of these approaches, want to see if the magnitude of the residuals is varying as a function of $x_{ij}$ or $\hat{y}_i$.

  • Ideal: no systematic variation (homoskedasticity)

  • Concern: magnitude of residuals varies systematically (heteroskedasticity)

  • Plotting $|r_i|$ focuses attention on magnitude.

  • Taking $\sqrt{|r_i|}$ removes skewness in plot (distribution of $|r_i|$ is right-skewed, square root makes it less so).

Standardized Residuals

Do you see a "fanning out pattern"? Any change in variability?

$\sqrt{|\text{Standardized Residual}|}$

Do you see a pattern (linear or nonlinear)? Or is it an amorphous blob?

An Example: Heteroskedasticity

Assessing Normality

Our methods for hypothesis tests / confidence intervals / prediction intervals have so far relied crucially on the normality assumption in the stronger linear model:

\[ \varepsilon \sim \text{MVN}(0, \sigma^2_\varepsilon I) \Leftrightarrow \varepsilon_i \overset{iid}{\sim} N(0, \sigma^2_\varepsilon) \]

Sample residuals $e_i$ are our best guesses for $\varepsilon_i$, so we will use these to assess normality.

$\varepsilon_i$ are identically distributed under the assumptions of the stronger linear model. What about $e_i$?

Limitations of Histograms

A seemingly natural approach might be to use a histogram to assess normality, and look for a "bell-shaped" / normally distributed.

  • Easy to distinguish a normal distribution from a distribution which is left or right skewed using a histogram

  • Much harder to distinguish a normal distribution from a bell-shaped distribution with heavier tails

Normal or $t$?

From these histograms, can you tell which is a Normal distribution, and which is a $t_{40}$?

Normal Quantile Plot

Also called a Q-Q Plot

qqnorm(scores)
qqline(scores)

How to Read a Normal Q-Q Plot

  • Vertical axis: sorted variable values

  • Horizontal axis: "theoretical normal quantiles"

  • "Under the hood":

    • The idea is that for example the $10^{th}$ smallest value of a variable with 252 cases is roughly an estimate of the $(10/252)\times 100\%$ quantile; hence plot the sorted values against the corresponding quantiles from a normal distribution.

How to Read a Normal Q-Q Plot

  • The normal curve can be used to calculate 'theoretical quantiles', which are plotted on the $x$ axis

    • If the data were normally distributed, then all I need to know are the mean and the standard deviation, and I can calculate all of the quantiles

    • So, take the variable in question, compute its mean and standard deviation, and then compute what the quantiles should be if the variable was normally distributed with that mean/sd.

The Normal $QQ$-Plot

To create a Normal $QQ$-Plot:

  1. Sort the standardized residuals $r_{[1]} \le r_{[2]} \cdots r_{[n]}$

  2. Compute $u_i = \Phi^{-1} \left( \frac{i}{n+1} \right)$. [Division by $n+1$ accounts for discreteness in data.]

  3. Plot $r_{[i]}$ against $u_i$.

Note: $\Phi^{-1}(\cdot) = \mathtt{qnorm(\cdot)}$. Takes as input a probability, gives as output the corresponding quantile.

The Normal $QQ$-Plot

We again use standardized residuals $r_i = e_i/(\hat{\sigma}_\varepsilon\sqrt{1-h_{ii}})$ instead of $e_i$

  • $e_i$ aren't identically distributed (different variances), but $r_i$ are.

  • $QQ$-plots are made for identically distributed variables.

How to Use a Normal Q-Q Plot

Interpreting the output:

  • Note: x-axis is in units of standard deviations above/below the mean, rather than units of the original variable.

  • Use: If the points don't deviate from the diagonal line much, then the variable is "approximately normally distributed."

  • Idea: The straight line tells where the sorted values of the variable should fall approximately IF they are normally distributed.

Normality in our Example

Make a normal quantile plot of the standardized residuals:

qqnorm(sresid.both)
qqline(sresid.both)

Do you see non-normality?

  • If yes, normality assumption of multiple regression model doesn't hold.

  • If no, nothing to refute normality

Reminder: Normality reflected by adherence to the diagonal line in the quantile plot, non normality shows a pattern.

Detecting Non-Normalities

How non-normalities show up in normal quantile plots:

  • Right-Skewness: frequent in finance/econ. This causes convex (cup-shaped) curvature in normal quantile plots

  • Left-skewed – concave curvature.

  • Outlying observations cause points to be too high on the right or too low on the left

  • Multi-modality (rare) causes snaking of the normal quantile plot

Let's show a few examples

  • What is the nature of the non-normality?

Right-Skewed

Multi-Modal

Heavy Tails