Influential observations
Leverage scores
Testing for outliers
Cook's distance
Suppose the stronger linear model holds,
\begin{align*} y = X\beta + \varepsilon;\;\; \varepsilon \sim \text{MVN}(0, \sigma^2_\varepsilon I_{n\times n}), \end{align*}What's $\text{E}(e_i\mid x_i) = \text{E}(y_i - x_i^\top\hat{\beta} \mid x_i)$?
What's $\text{var}(e_i\mid x_i)$?
What was the distribution of $e_i$?
Is $e_i$ independent of $e_j$? Are $e_1,...,e_n$ identically distributed?
What were standardized residuals, $r_i$?
The stronger linear model has the following core components:
Linearity: At any $x_i$, $\text{E}(y \mid x_i) = \beta_0 + \beta_1x_{i1} + ... + \beta_px_{ip} = x_i^\top\beta$
Homoskedasticity: For any $x_i$, $\text{var}(\varepsilon_i\mid x_i) = \sigma^2_\varepsilon$ for all $i$.
Normality: $\varepsilon_i$ are IID normally distributed
We discussed the use of diagnostic plots to assess each of these assumptions.
What plot(s) do we use to assess linearity? What are we hoping to see in these plots?
What plot(s) do we use to assess homoskedasticity? What are we hoping to see in these plots?
What plot(s) do we use to assess normality? What are we hoping to see in these plots?
Consider our test statistic for testing the null $H_0: \beta_j = c$ with a two-sided alternative
\begin{align*} t_{stat} &= \frac{\hat{\beta}_j - c}{\text{se}(\hat{\beta}_j)}\\ p_{val} &= P(|T|\geq |t_{stat}|), \end{align*}where $T$ follows a $t$ distribution with $n-p-1$ degrees of freedom, and $\text{se}(\hat{\beta}_j) = \hat{\sigma}_\varepsilon\big\{(X^\top X)^{-1}_{(j+1),(j+1)}\big\}^{1/2}$
Where does linearity play a role?
Where does homoskedasticity play a role?
Where does normality play a role?
In addition to our suite of regression diagnostics assessing linearity, homoskedasticity, and normality, we'll introduce additional diagnostics to help us flag potentially unusual / important observations in our data set
Outliers - observations do not behave like the bulk of our data points
Influential points - observations which unduly/substantially affect the prediction equation.
A data point could be none, one, or both of these.
Informally, an outlier is an observation that differs substantially from the other points in the data set.
Might be due to misrecording or typo when inputing the data, or due to a failed experiment
Might also represent a rare, but possible, outcome in our population of interest (might be an exciting discovery!)
Could represent that something has gone awry (for instance, credit card fraud in the distribution of one's monthly expenditures).
Informally, an outlier is an observation that differs substantially from the other points in the data set.
Might be due to lurking variables (for instance, if a professional athlete is included in an investigation of exercise habits).
Including an additional predictor variable can also reveal outliers that weren't apparent before.
When conducting a linear regression analysis, can we think of a numerical way to flag potential outliers?
First thought: look for observations with extremely large or small residuals, $e_i = y_i - x_i^\top\hat{\beta}$
Hope: individuals who deviate substantially from the majority of the data should have residuals that "stand out"
We could simply assess the distribution of the residuals, look for observations which are extreme.
Does this strategy work?
The blue lines are the least squares estimates of the line of best fit for these two data sets...
Individual observations vary in the influence they have over the fitted values, and over the estimated regression line.
From the previous slide, we see evidence that whether a not a data point's predictor value is "extreme" plays a key role here
We define the leverage, or self-influence, of the $i$th observation (with covariate $x_i$) as
\begin{align*} \text{Leverage}_i &= h_{i,i}, \end{align*}where $h_{i,i}$ is the $i$th diagonal entry of the hat matrix,
$$H = X(X^\top X)^{-1}X^\top.$$$h_{i,i}$ serves as a measure of how much the $i$th individual's observed value, $y_i$, influences the $i$th observation's fitted value, $\hat{y}_i$.
\begin{align*} \hat{y} &= Hy\\ \hat{y}_i &\textstyle= \sum_{j=1}^nh_{i,j}y_j\\ &\textstyle= h_{i,i}y_i + \sum_{j\neq i} h_{i,j}y_j \end{align*}When conducting a regression with an intercept, we have the following properties
Fitted value:
\[\textstyle \hat{y}_i = h_{i,i}y_i + \sum_{j\neq i} h_{i,j}y_j \]$h_{i,i}$ can be thought of as the weight placed on $y_i$ when forming $\hat{y}_i$. $\sum_{j\neq i} h_{i,j}y_j$ is the contribution from the other $n-1$ observations.
Consider the extreme case $h_{i,i}=1$:
What happens to $\hat{y}_i$?
What happens to $e_i$?
Under the weaker linear model, what happens to $\text{var}(\hat{y}_i\mid x_i)$?
What happens to $\text{var}(e_i\mid x_i)$?
Under the weaker linear model...
\begin{align*} \text{var}(e_i \mid x_i) &= (1-h_{i,i})\sigma^2_\varepsilon\\ \text{var}(\hat{y}_i \mid x_i) &= h_{i,i}\sigma^2_\varepsilon\\ \text{cor}(\hat{y}_i, y_i \mid x_i) &= (h_{i,i})^{\frac12} \end{align*}The larger $h_{i,i}$...
The smaller the variability in the $i$th residual
Less variability $\Rightarrow$ $y_i$ tends to be closer to $\hat{y}_i$.
The more $\hat{y}_i$ behaves like $y_i$.
The closer $\hat{y}_i$'s variability is to that of $y_i$ itself.
Through a bit of linear algebra (no need to derive), we can derive an alternative form for $h_{i,i}$ that helps understand what values of $x_i$ have high leverage.
Let $H_0$ be the "hat" matrix from a regression including only the intercept column.
Let $X_{\backslash 0}$ be our design matrix without the intercept column (so it has $p$ columns, one for each predictor, and $n$ rows)
Let $x_{i \backslash 0} = (x_{i1},...,{x}_{ip})$ be the predictor variables for the $i$th individual, with the intercept entry excluded
Let $\bar{x}_{\backslash 0} = (\bar{x}_1,...,\bar{x}_p)$ be the means of the predictor variables in our data
Second term is always non-negative
We see that $x_i = \bar{x}$ results in the smallest possible self-influence / leverage, $\frac1n$.
As $x_i$ moves farther away from $\bar{x}$, the leverage increases
Points with high leverage are outliers with respect to the distribution of the predictor variables
Leverages $h_{i,i}$ provide a scalar/numeric representation of the "self-influence" of a point
How much does the observation $y_i$ determine the fitted value $\hat{y}_i$?
We see it is a function of the predictor variables alone, as it is based on the hat matrix.
Leverages help us answer the following
"At what points $x_i$ would an outlier in my responses $y_i$ have the potential to dramatically alter my prediction equation?"
Rule of thumb: Leverages greater than $\frac{2(p+1)}{n}$ are considered high.
Can we come up with a numerical summary to help differentiate between these filled in points?
How do we distinguish between truly unusual outcomes and large, but not unexpected, values for residuals?
Issue: Outlying observations in the response may have small residuals if they also have high leverage.
Idea: Exclude point $i$, and recompute $\hat{\beta}_{(i)}$ and $\hat{y}_{(i)} = x_i^\top \hat{\beta}_{(i)}$ with the $i$th observation excluded.
Consequence: $y_i$ is independent of $\hat{y}_{(i)}$ under the stronger linear model. ($i$ wasn't used to form $\hat{y}_{(i)}$).
If $|y_i - \hat{y}_{(i)}|$ is large, it suggests that the outcome $y_i$ is an outlier...
Formally, we are testing
Null hypothesis: For all observations $a=1,...,n$, $y_a = x_a^\top\beta + \varepsilon_a$, $\varepsilon_a\overset{\text{iid}}{\sim} N(0, \sigma^2_\varepsilon)$
Alternative hypothesis: This model holds for all observations except for the $i$th individual, who comes from a different, unspecified, model.
Under the null hypothesis, we have that
\begin{align*} \text{E}(y_i - \hat{y}_{(i)}) &= 0\\ \text{var}(y_i - \hat{y}_{(i)}) &= \sigma^2_\varepsilon(1+x_i^\top(X_{(i)}^\top X_{(i)})^{-1}x_i), \end{align*}and that $y_i - \hat{y}_{(i)}$ is normally distributed.
[Does this remind you of the results underpinning our prediction intervals? It should...]
It turns out that, with $\hat{\sigma}_{\varepsilon(i)}$ the RMSE from a regression excluding the $i$th individual,
\begin{align*} t_i &= \frac{y_i - \hat{y}_{(i)}}{\hat{\sigma}_{\varepsilon (i)} \big\{1+x_i^\top(X_{(i)}^\top X_{(i)})^{-1}x_i\big\}^{\frac12}} \\ &\sim t_{n-(p+1)-1} \end{align*}These $t_i$ are called externally studentized residuals. Interestingly, $t_i$ can also be expressed as
\begin{align*} t_i &= r_i \left( \frac{n-(p+1)-1}{n-(p+1)-r_i^2} \right)^{1/2}, \end{align*}where $r_i$ are the standardized residuals we introduced previously.
For any particular $i$,
We could then compute $|t_i|$ and compare it to $t_{1-\alpha/2, (n-p-2)}$.
Reject if $|t_i|$ exceeds this value
For any particular $i$, this controls the Type I error rate (the probability of falsely rejecting a true null hypothesis) at $\alpha$.
Suppose I repeated this process for all $n$ observations in my study. For each observation, I try to determine whether that individual observation is an outlier.
What do you think the probability of committing at least one Type I error will be?
$\alpha$? Larger than $\alpha$?
Suppose that the stronger linear model actually holds for all $n$ observations, meaning that none of the points are actually generated from a different model.
\begin{align*} \text{Type I error rate} &= \Pr(\text{reject at least one test} \mid H_0 \text{ True}) \\ &\le\textstyle \sum_i \Pr(\text{reject test}\;i \mid H_0 \text{ True}) \\ &= n\alpha \end{align*}Bonferroni correction: test each hypothesis at level ${\color{blue}\alpha/n}$
Applying the above result: if each test is conducted at $\alpha/n$, the probability of committing at least one Type I error is bounded above by $\alpha$.
In general, it is not good practice to remove outliers unless we have a compelling reason that the point isn't representative of the population of interest
Good reason: data entry error, individual was included in data set who truly should have been excluded
In general, it is not good practice to remove outliers unless we have a compelling reason that the point isn't representative of the population of interest
Bad reasons
Outliers can indicate that your linear model doesn't apply to the entirety of the population under study, that there might be a lurking variable, that you might need a more complex model...
For an observation to be truly influential in changing altering the fitted regression equation, it needs to be at least one of the following
We've also seen that
Some outliers don't affect the regression equation particularly much (if they are low leverage points)
Some large leverage points don't end up impacting the fitted model too much (if their outcome follows the trend in the rest of the data)
Can we devise a measure to detect such influential points?
The Cook's distance $D_i$ of an observation $i$ is the sum of the squared differences between predicted values in regressions fit including $i$ and excluding $i$, normalized by $(p+1)\hat{\sigma}_\varepsilon^2$
\begin{align*} D_i &= \frac{(X\hat{\beta} - X\hat{\beta}_{(i)})^\top(X\hat{\beta} - X\hat{\beta}_{(i)})}{(p+1)\hat{\sigma}^2_\varepsilon}, \end{align*}where $\hat{\beta}_{(i)}$ are the regression coefficients from a regression excluding observation $i$.
Equivalently, Cook's distance can be expressed as
\begin{align*} D_i &= \frac{1}{p+1}r_i^2\frac{h_{i,i}}{1-h_{i,i}}, \end{align*}from whence we see that $D_i$ combines the impact of
Rule of thumb: $D_i > 1$ indicates an influential point
A Residuals vs Leverage plot visualizes
Standardized residuals $r_i$ ($y$ axis)
Value for Leverage ($x$ axis)
Cook's Distance (contours for $D_i = 0.5$ and $D_i=1$).
Here's a normal looking residuals vs leverage plot. Nothing concerning here...
In our simulated data sets: