Regression diagnostics
Assessing linearity
Assessing homoskedasticity
Assessing normality
To date we have derived properties of the ordinary least squares coefficients $\widehat{\boldsymbol\beta}$ under the assumption that a linear model holds. The weaker linear model states:
\[ y = X\beta + \varepsilon;\;\; \text{E}(\varepsilon)=0;\;\; \text{Var}(\varepsilon) = \sigma^2_\varepsilon I_{n\times n}. \]If this holds, we showed among many things:
$\text{E}(\widehat{\boldsymbol\beta}) = \beta$; $\text{Var}(\widehat{\boldsymbol\beta}) = \sigma^2_\varepsilon (X^\top X)^{-1}$; $\text{E}(y - X\widehat{\boldsymbol\beta}) = 0$.
If we further assume that $\varepsilon \sim \text{MVN}(0, \sigma^2_\varepsilon I_{n\times n})$, we were able to construct:
Hypothesis tests/confidence intervals for slope coefficients
Hypothesis tests to compare models of differing complexity
Confidence intervals for the conditional expectation
Prediction intervals for future observations
The stronger linear model has the following core components:
Linearity: At any $x_i$, $\text{E}(y \mid x_i) = \beta_0 + \beta_1x_{i1} + \dots + \beta_px_{ip} = x_i^\top\beta$
Homoskedasticity: For any $x_i$, $\text{Var}(\varepsilon_i\mid x_i) = \sigma^2_\varepsilon$ for all $i$.
Normality: $\varepsilon_i$ are $iid$ normally distributed
Violations of these assumptions can invalidate the methods we've proposed. For instance...
$\text{E}(y-X\widehat{\boldsymbol\beta}) \neq 0$
Our hypothesis tests might commit a Type I error at a rate different from $\alpha$
The coverage of our confidence intervals might differ from $100(1-\alpha)\%$.
Our prediction intervals for future observations $y^*$ might not capture future observations at rate $100(1-\alpha)\%$.
Regression Diagnostics are tools/methods for determining whether a regression model adequately represents the data at hand
Do the assumptions of the stronger linear model seem reasonable?
Or is there a notable violation of one or more underlying assumptions?
Diagnostics take two main forms:
Diagnostic Plots – create a plot which can visually reveal a violation of a modeling assumption.
Diagnostic Tests – test the null hypothesis that a given modeling assumption holds. Rejection of the null suggests a violation of the assumption under consideration.
We'll mainly discuss graphical diagnostics, but note that there are often well-established numeric diagnostic tests for these tools.
When devising a diagnostic plot, we need to consider
What the plot should display when the assumptions of the linear model hold
What might appear within the plot should an assumption be violated
Issue: can't visualize relationship between $y$ and $x_1,\dots,x_p$ unless $p=1$ (or $p=2$ with appropriate plotting software).
We need to consider low-dimensional consequences of the linear model
Many diagnostics will use the residuals $e$, and/or the fitted values $\hat{y}$. Let $H = X(X^\top X)^{-1}X^\top$ be the hat matrix, and let $h_{ij}$ denote its $i,j$ element.
\begin{align*} e &= (I-H)y\\ \hat{y} &= Hy \end{align*}We've previously discussed algebraic properties of $e$ and $\hat{y}$ when running a regression with an intercept. Letting $\widehat{\text{cov}}(\cdot, \cdot)$ be the sample covariance operator:
\begin{align*} \widehat{\text{cov}}(e, x_j) &= 0\;\; (j=1,\dots,p)\\ \widehat{\text{cov}}(e, \hat{y}) &= 0\\\textstyle \bar{e} = \frac{1}{n}\sum_{i=1}^n e_i &= 0 \end{align*}These properties are purely algebraic, and hold regardless of whether the weaker linear model holds.
We'll now derive additional properties of $e$ and $\hat{y}$ under the assumption that the weaker or stronger linear model holds.
Properties represent testable implications of these models.
Expectation of Residuals and Fitted Values
Under the weaker linear model
\begin{align*} \text{E}(e) &= \text{E}((I-H)y) = 0\\ \text{E}(\hat{y}) &= \text{E}(Hy) = X\beta \end{align*}Important to keep in mind the distinction between
$\text{E}(e_i) = 0$
$\bar{e} = 0$
The former is a statement about the expectation for any particular residual. The latter merely says that the sample average of the residuals equals zero in any particular data set, but doesn't describe components $e_i$.
We can also derive the variance of the residual vector.
This variance is across data sets generated by the linear model
Variance of Residuals and Fitted Values
Under the weaker linear model
\begin{align*} \text{Var}(e) &= \sigma^2_\varepsilon (I-H)\\ \text{Var}(\hat{y}) &= \sigma^2_\varepsilon H \end{align*}Keep in mind the distinction between
$\text{var}(e_i) = \sigma^2_\varepsilon (1-h_{ii})$
$\hat{\sigma}^2_\varepsilon = \sum_{i=1}^ne_i^2/(n-p-1)$
The former describes variability for any component $e_i$ across data sets. The latter is the sample variance for the observed residuals.
When performing regression with an intercept and $p$ predictors, the diagonal entries of the hat matrix $H$, $h_{ii}$, satisfy the following:
$\frac{1}{n}\leq h_{ii}\leq 1 \;\; (i=1,\dots,n)$
$\sum_{i=1}^n h_{ii} = p+1$
Implication:
\[ \text{var}(e_i) = \sigma^2_\varepsilon (1-h_{ii}) \leq \sigma^2_\varepsilon = \text{var}(\varepsilon_i). \]In words:
Residuals don't generally have the same variance, $\text{var}(e_i)\neq \text{var}(e_{j})$ for $i\neq j$
Residuals $e_i$ have smaller variance than $\varepsilon_i$
For any $i,j = 1,\dots,n$
\begin{align*} \text{cov}(e_i, e_{j}) &= -\sigma^2_\varepsilon h_{ij}\\ \text{cov}(\hat{y}_i, \hat{y}_j) &= \sigma^2_\varepsilon h_{ij} \end{align*}For regression with an intercept, we also have the bound
\[ 1/n - 1/2 \leq h_{ij} \leq 1/2 \]Implication: despite the fact that $\text{cov}(\varepsilon_i, \varepsilon_j)=0$ under the weaker linear model, $\text{cov}(e_i, e_{j}) \neq 0$.
For $Z\in \mathbb{R}^m$, $Y\in \mathbb{R}^n$ random vectors:
\begin{align*} \text{Cov}(Z,Y) &= \text{E}[(Z-\text{E}(Z))(Y-\text{E}(Y))^\top]\\ &= \text{E}(ZY^\top) - \text{E}(Z)\text{E}(Y)^\top\\ &=\left( \begin{array}{cccc} \text{cov}(Z_1, Y_1) & \cdots & \cdots & \text{cov}(Z_1, Y_n) \\ \text{cov}(Z_2, Y_1) & \text{cov}(Z_2,Y_2) & \cdots & \text{cov}(Z_2, Y_n) \\ \vdots & \ddots & \ddots & \vdots \\ \text{cov}(Z_m, Y_1) & \cdots & \cdots & \text{cov}(Z_m, Y_n) \end{array} \right) \end{align*}$\text{Cov}(Z,Y) \in \mathbb{R}^{m\times n}$; $\text{Cov}(Z,Y) = \text{Cov}(Y,Z)^\top$. $\text{Cov}(Z,Z) = \text{Var}(Z)$.
Properties of Covariance
For $A\in \mathbb{R}^{d\times m}$, $B\in \mathbb{R}^{k\times n}$:
\[ \text{Cov}(AZ, BY) = A\text{Cov}(Z,Y)B^\top \]Using this definition, we can characterize the covariance between residuals and fitted values under the weaker linear model:
Covariance of Residuals and Fitted Values
Under the weaker linear model
\[ \text{Cov}(e,\hat{y}) = \text{Cov}((I-H)y, Hy) = 0. \]For any $i,j=1,\dots,n$, $\text{cov}(e_i, \hat{y}_j) = 0$.
This is distinct from the result that the sample covariance between $e$ and $\hat{y}$ equals zero when running regression with an intercept, $\widehat{\text{cov}}(e, \hat{y}) = 0$.
$\text{cov}(e_i, \hat{y}_i) = 0$: Property of random variables across data sets under the linear model.
$\widehat{\text{cov}}(e, \hat{y}) = 0$: Algebraic property.
Let's now consider the distribution of the residuals and fitted values:
Distribution of Residuals and Fitted Values
Under the stronger linear model,
\[ \left( \begin{array}{c} e \\ \hat{y} \end{array} \right) \sim \text{MVN}\left( \left( \begin{array}{c} 0 \\ X\beta \end{array} \right), \left( \begin{array}{c c} (I-H)\sigma^2_\varepsilon & 0 \\ 0 & H\sigma^2_\varepsilon \end{array} \right) \right) \]Implications under the stronger linear model:
$e_i \sim N(0, (1-h_{ii})\sigma^2_\varepsilon)$.
$e$ and $\hat{y}$ are independent of one another (relies crucially on multivariate normality).
Linearity, $y = X\beta +\varepsilon$ with $\text{E}(\varepsilon) = 0$, tells us that for each $i$,
\[ \text{E}(e_i\mid x_i) = \text{E}(y_i - x_i^\top\widehat{\boldsymbol\beta}\mid x_i) = 0, \]i.e. that the expected value of our sample residual $e_i$ at any point $x_i$ equals zero.
Nonlinearity, on the other hand, could be written as:
\[ y_i = g(x_i) + \varepsilon_i;\;\; \text{E}(\varepsilon_i) = 0, \]for some nonlinear function $g(\cdot)$. In this case:
\[ \text{E}(e_i\mid x_i) = \text{E}(y_i - x_i^\top\widehat{\boldsymbol\beta}\mid x_i) = \eta(x_i) \]for some non-zero function $\eta(\cdot)$.
Expectation of sample residuals:
\[ \text{E}(e_i\mid x_i)=\text{E}(y_i - x_i^\top\widehat{\boldsymbol\beta}\mid x_i) = \eta(x_i) \]Linearity: $\text{E}(e_i\mid x_i) = 0$ for all $x_i$
No pattern in residuals as a function of any of the predictor variables $x_{i1}, x_{i2},\dots,x_{ip}$.
Nonlinearity: $\text{E}(e_i\mid x_i) = \eta(x_i)$ for some nonlinear function $\eta(x_i)$
Can't generally visualize $e_i$ as joint function of $x_{i1},\dots,x_{ip}$
Can separately visualize relationship between $e_i$ and $x_{i1}$, $e_i$ and $x_{i2}$,... $e_i$ and $x_{ip}$ through $p$ scatterplots (2 dimensional projections).
Hope: high-dimensional nonlinearity reveals itself in lower dimensions
These plots with the residuals on the $y$ axis and predictors on the $x$ axis are examples of residual plots.
Do you see clear evidence of curvature?
Ideal residual plot should look like an amorphous blob. No systematic variation between residuals and $x_i$
Are there values of $x_i$ where the residuals are predictably positive or negative?
Remember: residuals will always have mean zero (the red line on these plots)
In our mall sales example:
In addition, we also plot the relationship between residuals and fitted values: $e$ versus $\hat{y}$
If I had access to $\beta$, we know that $\text{E}(e_i\mid x_i^\top\beta) = 0$ for all $x_i$.
A plot of $e$ against $\text{E}(y) = X\beta$ would reveal no discernable pattern under linearity
Replace $x_i^\top \beta$ by our estimate, $x_i^\top\widehat{\boldsymbol\beta} = \hat{y}_i$
Hope: nonlinearity might reveal itself with systematic variation in residuals as a function of predicted values.
Note: of all residual plots targeting nonlinearity, this plot is the most commonly used in practice
In our mall sales example:
Homoskedasticity tells us that, for each $x_i$:
\[ \text{var}(\varepsilon_i\mid x_i) = \sigma^2_\varepsilon \]We don't get to observe $\varepsilon_i = y_i - x_i^\top\beta$, and instead observe its sample analogue $e_i = y_i - x_i^\top\widehat{\boldsymbol\beta}$. Can we use $e_i$ to assess homoskedasticity?
\[ \text{var}(e_i\mid x_i) = (1-h_{ii})\sigma^2_\varepsilon \]Generally $h_{ii}\neq h_{jj}$ for $i\neq j$.
Even under homoskedasticity, our sample residuals would not have constant variance. $\text{var}(e_i\mid x_i)\neq \text{var}(e_j\mid x_j)$.
Fortunately, there's a straightforward fix under the weaker linear model:
\[ \begin{aligned} &\text{var}\left(\frac{{e}_i}{\sqrt{1-h_{ii}}}\mid x_i\right) \\ &\quad=\text{Var}(e_i\mid x_i)/(1-h_{ii}) \\ &\quad=\sigma^2_\varepsilon(1-h_{ii})/(1-h_{ii}) \\ &\quad= \sigma^2_\varepsilon \end{aligned} \]This motivates the use of an alternative residual, called the standardized residual, for assessing homoskedasticity
Standardized / Internally Studentized Residuals
The standardized residual (also called the internally studentized residual) takes the following form for each observation $i$:
\[ r_i = \frac{e_i}{\hat{\sigma}_\varepsilon \sqrt{1-h_{ii}}} \](Division by $\hat{\sigma}_\varepsilon$ removes dependence on scale).
Common approaches using standardized residuals include:
Plot $r_i$ against $x_{ij}$ for $j=1,\dots,p$
Plot $r_i$ against $\hat{y}_i$
Plot $\sqrt{|r_i|}$ against $x_{ij}$ for $j=1,\dots,p$
Plot $\sqrt{|r_i|}$ against $\hat{y}_i$
With any of these approaches, want to see if the magnitude of the residuals is varying as a function of $x_{ij}$ or $\hat{y}_i$.
Ideal: no systematic variation (homoskedasticity)
Concern: magnitude of residuals varies systematically (heteroskedasticity)
Plotting $|r_i|$ focuses attention on magnitude.
Taking $\sqrt{|r_i|}$ removes skewness in plot (distribution of $|r_i|$ is right-skewed, square root makes it less so).
Do you see a "fanning out pattern"? Any change in variability?
Do you see a pattern (linear or nonlinear)? Or is it an amorphous blob?
Our methods for hypothesis tests / confidence intervals / prediction intervals have so far relied crucially on the normality assumption in the stronger linear model:
\[ \varepsilon \sim \text{MVN}(0, \sigma^2_\varepsilon I) \Leftrightarrow \varepsilon_i \overset{iid}{\sim} N(0, \sigma^2_\varepsilon) \]Sample residuals $e_i$ are our best guesses for $\varepsilon_i$, so we will use these to assess normality.
$\varepsilon_i$ are identically distributed under the assumptions of the stronger linear model. What about $e_i$?
A seemingly natural approach might be to use a histogram to assess normality, and look for a "bell-shaped" / normally distributed.
Easy to distinguish a normal distribution from a distribution which is left or right skewed using a histogram
Much harder to distinguish a normal distribution from a bell-shaped distribution with heavier tails
From these histograms, can you tell which is a Normal distribution, and which is a $t_{40}$?
Also called a Q-Q Plot
qqnorm(scores)
qqline(scores)
Vertical axis: sorted variable values
Horizontal axis: "theoretical normal quantiles"
"Under the hood":
The idea is that for example the $10^{th}$ smallest value of a variable with 252 cases is roughly an estimate of the $(10/252)\times 100\%$ quantile; hence plot the sorted values against the corresponding quantiles from a normal distribution.
The normal curve can be used to calculate 'theoretical quantiles', which are plotted on the $x$ axis
If the data were normally distributed, then all I need to know are the mean and the standard deviation, and I can calculate all of the quantiles
So, take the variable in question, compute its mean and standard deviation, and then compute what the quantiles should be if the variable was normally distributed with that mean/sd.
To create a Normal $QQ$-Plot:
Sort the standardized residuals $r_{[1]} \le r_{[2]} \cdots r_{[n]}$
Compute $u_i = \Phi^{-1} \left( \frac{i}{n+1} \right)$. [Division by $n+1$ accounts for discreteness in data.]
Plot $r_{[i]}$ against $u_i$.
Note: $\Phi^{-1}(\cdot) = \mathtt{qnorm(\cdot)}$. Takes as input a probability, gives as output the corresponding quantile.
We again use standardized residuals $r_i = e_i/(\hat{\sigma}_\varepsilon\sqrt{1-h_{ii}})$ instead of $e_i$
$e_i$ aren't identically distributed (different variances), but $r_i$ are.
$QQ$-plots are made for identically distributed variables.
Interpreting the output:
Note: x-axis is in units of standard deviations above/below the mean, rather than units of the original variable.
Use: If the points don't deviate from the diagonal line much, then the variable is "approximately normally distributed."
Idea: The straight line tells where the sorted values of the variable should fall approximately IF they are normally distributed.
Make a normal quantile plot of the standardized residuals:
qqnorm(sresid.both)
qqline(sresid.both)
Do you see non-normality?
If yes, normality assumption of multiple regression model doesn't hold.
If no, nothing to refute normality
Reminder: Normality reflected by adherence to the diagonal line in the quantile plot, non normality shows a pattern.
How non-normalities show up in normal quantile plots:
Right-Skewness: frequent in finance/econ. This causes convex (cup-shaped) curvature in normal quantile plots
Left-skewed – concave curvature.
Outlying observations cause points to be too high on the right or too low on the left
Multi-modality (rare) causes snaking of the normal quantile plot
Let's show a few examples
What is the nature of the non-normality?