STATS 413: Lecture 12

Influential observations
Leverage scores
Testing for outliers
Cook's distance

Recap: Residuals and the Stronger Linear Model

Suppose the stronger linear model holds,

\begin{align*} y = X\beta + \varepsilon;\;\; \varepsilon \sim \text{MVN}(0, \sigma^2_\varepsilon I_{n\times n}), \end{align*}

What's $\text{E}(e_i\mid x_i) = \text{E}(y_i - x_i^\top\hat{\beta} \mid x_i)$?
What's $\text{var}(e_i\mid x_i)$?
What was the distribution of $e_i$?
Is $e_i$ independent of $e_j$? Are $e_1,...,e_n$ identically distributed?
What were standardized residuals, $r_i$?

Recap: Diagnostics for the Stronger Linear Model

The stronger linear model has the following core components:

Linearity: At any $x_i$, $\text{E}(y \mid x_i) = \beta_0 + \beta_1x_{i1} + ... + \beta_px_{ip} = x_i^\top\beta$
Homoskedasticity: For any $x_i$, $\text{var}(\varepsilon_i\mid x_i) = \sigma^2_\varepsilon$ for all $i$.
Normality: $\varepsilon_i$ are IID normally distributed

Recap: Diagnostics for the Stronger Linear Model

We discussed the use of diagnostic plots to assess each of these assumptions.

What plot(s) do we use to assess linearity? What are we hoping to see in these plots?
What plot(s) do we use to assess homoskedasticity? What are we hoping to see in these plots?
What plot(s) do we use to assess normality? What are we hoping to see in these plots?

Recap: Importance of the Assumptions

Consider our test statistic for testing the null $H_0: \beta_j = c$ with a two-sided alternative

\begin{align*} t_{stat} &= \frac{\hat{\beta}_j - c}{\text{se}(\hat{\beta}_j)}\\ p_{val} &= P(|T|\geq |t_{stat}|), \end{align*}

where $T$ follows a $t$ distribution with $n-p-1$ degrees of freedom, and $\text{se}(\hat{\beta}_j) = \hat{\sigma}_\varepsilon\big\{(X^\top X)^{-1}_{(j+1),(j+1)}\big\}^{1/2}$

Where does linearity play a role?
Where does homoskedasticity play a role?
Where does normality play a role?

Today: Finding Unusual Data Points

In addition to our suite of regression diagnostics assessing linearity, homoskedasticity, and normality, we'll introduce additional diagnostics to help us flag potentially unusual / important observations in our data set

Outliers - observations do not behave like the bulk of our data points
- Seem to represent a departure from the trends observed in the majority of our observations.

Today: Finding Unusual Data Points

Influential points - observations which unduly/substantially affect the prediction equation.
- Not all observations have the same impact on the OLS solution $\hat{\beta}$.
- What makes a point matter more or less?

A data point could be none, one, or both of these.

Outliers

Informally, an outlier is an observation that differs substantially from the other points in the data set.

Might be due to misrecording or typo when inputing the data, or due to a failed experiment
Might also represent a rare, but possible, outcome in our population of interest (might be an exciting discovery!)
Could represent that something has gone awry (for instance, credit card fraud in the distribution of one's monthly expenditures).

Outliers

Informally, an outlier is an observation that differs substantially from the other points in the data set.

Might be due to lurking variables (for instance, if a professional athlete is included in an investigation of exercise habits).
Including an additional predictor variable can also reveal outliers that weren't apparent before.

Potential Outliers and Lurking Variables

Flagging Outliers Numerically

When conducting a linear regression analysis, can we think of a numerical way to flag potential outliers?

First thought: look for observations with extremely large or small residuals, $e_i = y_i - x_i^\top\hat{\beta}$

Hope: individuals who deviate substantially from the majority of the data should have residuals that "stand out"
We could simply assess the distribution of the residuals, look for observations which are extreme.
Does this strategy work?

Residuals and Outliers

The blue lines are the least squares estimates of the line of best fit for these two data sets...

Self-Influence / Leverage

Individual observations vary in the influence they have over the fitted values, and over the estimated regression line.

From the previous slide, we see evidence that whether a not a data point's predictor value is "extreme" plays a key role here

We define the leverage, or self-influence, of the $i$th observation (with covariate $x_i$) as

\begin{align*} \text{Leverage}_i &= h_{i,i}, \end{align*}

where $h_{i,i}$ is the $i$th diagonal entry of the hat matrix,

$$H = X(X^\top X)^{-1}X^\top.$$

Self-Influence / Leverage

\begin{align*} h_{i,i} &= x_i^\top(X^\top X)^{-1}x_i \end{align*}

Leverage thus depends not only on $x_i$, but on the predictor variables for the other $n-1$ individuals in the data set.

Leverage and Fitted Values

$h_{i,i}$ serves as a measure of how much the $i$th individual's observed value, $y_i$, influences the $i$th observation's fitted value, $\hat{y}_i$.

\begin{align*} \hat{y} &= Hy\\ \hat{y}_i &\textstyle= \sum_{j=1}^nh_{i,j}y_j\\ &\textstyle= h_{i,i}y_i + \sum_{j\neq i} h_{i,j}y_j \end{align*}

When conducting a regression with an intercept, we have the following properties

$1/n\leq h_{i,i}\leq 1$
$\sum_{i=1}^n h_{i,i} = (p+1)$
$h_{i,i}(1-h_{i,i}) = \sum_{i\neq j} h_{i,j}^2$ (comes from idempotence, $HH=H$).

Intuition for Leverage

Fitted value:

\[\textstyle \hat{y}_i = h_{i,i}y_i + \sum_{j\neq i} h_{i,j}y_j \]

$h_{i,i}$ can be thought of as the weight placed on $y_i$ when forming $\hat{y}_i$. $\sum_{j\neq i} h_{i,j}y_j$ is the contribution from the other $n-1$ observations.

Consider the extreme case $h_{i,i}=1$:

What happens to $\hat{y}_i$?
What happens to $e_i$?
Under the weaker linear model, what happens to $\text{var}(\hat{y}_i\mid x_i)$?
What happens to $\text{var}(e_i\mid x_i)$?

Intuition for Leverage

Under the weaker linear model...

\begin{align*} \text{var}(e_i \mid x_i) &= (1-h_{i,i})\sigma^2_\varepsilon\\ \text{var}(\hat{y}_i \mid x_i) &= h_{i,i}\sigma^2_\varepsilon\\ \text{cor}(\hat{y}_i, y_i \mid x_i) &= (h_{i,i})^{\frac12} \end{align*}

The larger $h_{i,i}$...

The smaller the variability in the $i$th residual
Less variability $\Rightarrow$ $y_i$ tends to be closer to $\hat{y}_i$.
The more $\hat{y}_i$ behaves like $y_i$.
The closer $\hat{y}_i$'s variability is to that of $y_i$ itself.

Leverage and "Outliers in X"

Through a bit of linear algebra (no need to derive), we can derive an alternative form for $h_{i,i}$ that helps understand what values of $x_i$ have high leverage.

Let $H_0$ be the "hat" matrix from a regression including only the intercept column.
Let $X_{\backslash 0}$ be our design matrix without the intercept column (so it has $p$ columns, one for each predictor, and $n$ rows)
Let $x_{i \backslash 0} = (x_{i1},...,{x}_{ip})$ be the predictor variables for the $i$th individual, with the intercept entry excluded
Let $\bar{x}_{\backslash 0} = (\bar{x}_1,...,\bar{x}_p)$ be the means of the predictor variables in our data

Leverage and "Outliers in X"

\[\textstyle h_{i,i} = \frac{1}{n} + (x_{i\backslash 0} - \bar{x}_{\backslash 0})^\top(X_{\backslash 0}^\top(I-H_0)X_{\backslash 0})^{-1}({x}_{i \backslash 0} - \bar{x}_{\backslash 0}) \]

Second term is always non-negative

If $x_i = \bar{x}$, second term cancels out, left with $\frac1n$.

We see that $x_i = \bar{x}$ results in the smallest possible self-influence / leverage, $\frac1n$.
As $x_i$ moves farther away from $\bar{x}$, the leverage increases
Points with high leverage are outliers with respect to the distribution of the predictor variables

Leverage and "Outliers in X"

Uses for Leverage

Leverages $h_{i,i}$ provide a scalar/numeric representation of the "self-influence" of a point

How much does the observation $y_i$ determine the fitted value $\hat{y}_i$?

We see it is a function of the predictor variables alone, as it is based on the hat matrix.

Uses for Leverage

Leverages help us answer the following

"At what points $x_i$ would an outlier in my responses $y_i$ have the potential to dramatically alter my prediction equation?"
Low leverage - outlying $y_i$ unlikely to impact our prediction equation
High leverage - outlying $y_i$ can dramatically impact the prediction equation

Rule of thumb: Leverages greater than $\frac{2(p+1)}{n}$ are considered high.

Which Point is an Outlier?

Can we come up with a numerical summary to help differentiate between these filled in points?

Detecting Outliers

How do we distinguish between truly unusual outcomes and large, but not unexpected, values for residuals?

Issue: Outlying observations in the response may have small residuals if they also have high leverage.
Idea: Exclude point $i$, and recompute $\hat{\beta}_{(i)}$ and $\hat{y}_{(i)} = x_i^\top \hat{\beta}_{(i)}$ with the $i$th observation excluded.
Consequence: $y_i$ is independent of $\hat{y}_{(i)}$ under the stronger linear model. ($i$ wasn't used to form $\hat{y}_{(i)}$).
If $|y_i - \hat{y}_{(i)}|$ is large, it suggests that the outcome $y_i$ is an outlier...

Removing an Observation

A Test for Outliers

Formally, we are testing

Null hypothesis: For all observations $a=1,...,n$, $y_a = x_a^\top\beta + \varepsilon_a$, $\varepsilon_a\overset{\text{iid}}{\sim} N(0, \sigma^2_\varepsilon)$
Alternative hypothesis: This model holds for all observations except for the $i$th individual, who comes from a different, unspecified, model.

A Test for Outliers

Under the null hypothesis, we have that

\begin{align*} \text{E}(y_i - \hat{y}_{(i)}) &= 0\\ \text{var}(y_i - \hat{y}_{(i)}) &= \sigma^2_\varepsilon(1+x_i^\top(X_{(i)}^\top X_{(i)})^{-1}x_i), \end{align*}

and that $y_i - \hat{y}_{(i)}$ is normally distributed.

[Does this remind you of the results underpinning our prediction intervals? It should...]

Externally Studentized Residuals

It turns out that, with $\hat{\sigma}_{\varepsilon(i)}$ the RMSE from a regression excluding the $i$th individual,

\begin{align*} t_i &= \frac{y_i - \hat{y}_{(i)}}{\hat{\sigma}_{\varepsilon (i)} \big\{1+x_i^\top(X_{(i)}^\top X_{(i)})^{-1}x_i\big\}^{\frac12}} \\ &\sim t_{n-(p+1)-1} \end{align*}

Externally Studentized Residuals

These $t_i$ are called externally studentized residuals. Interestingly, $t_i$ can also be expressed as

\begin{align*} t_i &= r_i \left( \frac{n-(p+1)-1}{n-(p+1)-r_i^2} \right)^{1/2}, \end{align*}

where $r_i$ are the standardized residuals we introduced previously.

Implication: we can actually compute $t_i$ without computing a regression that excludes the $i$th individual.

Multiple Hypothesis Tests

For any particular $i$,

We could then compute $|t_i|$ and compare it to $t_{1-\alpha/2, (n-p-2)}$.
Reject if $|t_i|$ exceeds this value
For any particular $i$, this controls the Type I error rate (the probability of falsely rejecting a true null hypothesis) at $\alpha$.

Multiple Hypothesis Tests

Suppose I repeated this process for all $n$ observations in my study. For each observation, I try to determine whether that individual observation is an outlier.

What do you think the probability of committing at least one Type I error will be?
$\alpha$? Larger than $\alpha$?

Bonferroni Correction

Suppose that the stronger linear model actually holds for all $n$ observations, meaning that none of the points are actually generated from a different model.

\begin{align*} \text{Type I error rate} &= \Pr(\text{reject at least one test} \mid H_0 \text{ True}) \\ &\le\textstyle \sum_i \Pr(\text{reject test}\;i \mid H_0 \text{ True}) \\ &= n\alpha \end{align*}

Bonferroni correction: test each hypothesis at level ${\color{blue}\alpha/n}$

Applying the above result: if each test is conducted at $\alpha/n$, the probability of committing at least one Type I error is bounded above by $\alpha$.

What to Do with Outliers?

In general, it is not good practice to remove outliers unless we have a compelling reason that the point isn't representative of the population of interest

Good reason: data entry error, individual was included in data set who truly should have been excluded

What to Do with Outliers?

In general, it is not good practice to remove outliers unless we have a compelling reason that the point isn't representative of the population of interest

Bad reasons
- Slope coefficients become more significant when excluded
- Data looks more linear when observation excluded
- $R^2$ increases when observation excluded
- ...

Outliers can indicate that your linear model doesn't apply to the entirety of the population under study, that there might be a lurking variable, that you might need a more complex model...

Combining Leverage and Outliers

For an observation to be truly influential in changing altering the fitted regression equation, it needs to be at least one of the following

Outlier
Large leverage point

We've also seen that

Some outliers don't affect the regression equation particularly much (if they are low leverage points)
Some large leverage points don't end up impacting the fitted model too much (if their outcome follows the trend in the rest of the data)

Can we devise a measure to detect such influential points?

Cook's Distance

The Cook's distance $D_i$ of an observation $i$ is the sum of the squared differences between predicted values in regressions fit including $i$ and excluding $i$, normalized by $(p+1)\hat{\sigma}_\varepsilon^2$

\begin{align*} D_i &= \frac{(X\hat{\beta} - X\hat{\beta}_{(i)})^\top(X\hat{\beta} - X\hat{\beta}_{(i)})}{(p+1)\hat{\sigma}^2_\varepsilon}, \end{align*}

where $\hat{\beta}_{(i)}$ are the regression coefficients from a regression excluding observation $i$.

Cook's Distance

Equivalently, Cook's distance can be expressed as

\begin{align*} D_i &= \frac{1}{p+1}r_i^2\frac{h_{i,i}}{1-h_{i,i}}, \end{align*}

from whence we see that $D_i$ combines the impact of

Large residual (in terms of standardized residual $r_i$)
Large leverage ($h_{i,i}$)

Rule of thumb: $D_i > 1$ indicates an influential point

A Diagnostic Plot for Influential Points

A Residuals vs Leverage plot visualizes

Standardized residuals $r_i$ ($y$ axis)
Value for Leverage ($x$ axis)
Cook's Distance (contours for $D_i = 0.5$ and $D_i=1$).

Residuals vs Leverage

Here's a normal looking residuals vs leverage plot. Nothing concerning here...

Residuals vs Leverage

In our simulated data sets: