STATS 413: Lecture 18

  1. Polynomial Regression

  2. Overfitting

  3. Bias-variance trade-off

Nonlinear Relationships

As we know by now, a key assumption underpinning multiple regression is the assumption of linearity.

\begin{align*} \text{E}(y \mid x) &= \beta_0 + \beta_1 x_1 + \dots + \beta_p x_p \end{align*}

In practice, we might instead believe

\begin{align*} \text{E}(y \mid x) &= f(x_1,\dots,x_p). \end{align*}

Nonlinear Relationships

Our real data may be noisy observations from some complicated, nonlinear, multivariate function $f(\cdot)$.

Can we use least squares regression to accommodate nonlinearity?

  • If there's a particular transformation of $y$ or $x$ that is known or suspected to make the relationship linear, apply it. For example, a $\log$ transformation.

  • What if we don't know $f(\cdot)$?

An Exercise in One Dimension

Suppose we are presented with the following data set, and are asked to come up with a predictive model for $y$ as a function of $x$.

An Exercise in One Dimension

It's clearly not linear. Instead, it seems that $y_i = f(x_i) + \varepsilon_i$ for some nonlinear function $f(\cdot)$.

  • Could we still use multiple regression to help us here?

A Reminder: Taylor Expansions

Taylor Series

A function $f$ infinitely differentiable at a point $a$ can be expressed as

\begin{align*} f(x) &= \sum_{n=0}^{\infty}\frac{f^{(n)}(a)}{n!}(x-a)^n, \end{align*}

A Reminder: Taylor Expansions

Here $f^{(n)}(\cdot)$ is the $n^{\text{th}}$ derivative of $f(\cdot)$.

Setting $a=0$ above and letting $\beta_j = f^{(j)}(0)/j!$:

\begin{align*} f(x) &= \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \dots\\ &= \sum_{j=0}^{\infty}\beta_j x^j \end{align*}

Polynomial Approximation

Truncating the infinite sum after $p$ polynomials, we may instead consider

\begin{align*} f(x) &\approx \sum_{j=0}^p \beta_j x^j \end{align*}

That is, we can approximate $f(x)$ through a sum of polynomials of increasing degrees.

Polynomial Approximation

Suppose we observe noisy data $y_i = f(x_i) + \varepsilon_i$.

  • Do we know the true coefficients $\beta_j$?

  • If not, how might we estimate them?

Least Squares to the Rescue

A reasonable approach for finding the best approximating polynomial of degree $p$ would be to run ordinary least squares.

\begin{align*} \min_{\beta_0,\dots,\beta_p}\sum_{i=1}^n\{y_i - (\beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \dots + \beta_p x_i^p)\}^2 \end{align*}

Least Squares to the Rescue

Consider creating new covariates in R of the form $x_j = x^j$.

lm(y ~ x1 + ... + xp)

lm(y ~ poly(x, degree = p, raw = TRUE))

Here we aren't assuming anything about a "correct model," so statistical inference is out the window. We are simply trying to best approximate the functional relationship we are observing.

Approximations of Increasing Polynomial Degree

Interpretations for Slope Coefficients?

Consider the second degree polynomial:

\begin{align*} \hat{f}(x) &= \hat{\beta}_0 + \hat{\beta}_1 x + \hat{\beta}_2 x^2 \end{align*}

The usual way we interpret slope coefficients falls short here.

Interpretations for Slope Coefficients?

  • $\hat{\beta}_2$: "Holding all other variables fixed, two units that differ in $x^2$ by one unit are estimated to differ in $y$ by $\hat{\beta}_2$ units."

  • This would be nonsense. If the units differ in $x^2$, they also differ in $x$. Can't hold all else equal.

We don't ascribe interpretations to these coefficients, but rather focus mainly on the quality of the resulting fit.

Improvements (??) In Performance

Let's look at the $R^2$ values as we increase the degree.

Degree $R^2$
Linear 0.21
Quadratic 0.52
Cubic 0.55
Quartic 0.72
$\vdots$ $\vdots$
Fifteenth 0.78

Improvements (??) In Performance

  • Why is $R^2$ increasing in the degree?

  • Does this mean we should keep adding polynomial terms?

Danger: Overfitting

Overfitting

Overfitting is the production of an analysis that corresponds too closely, or even exactly, to a particular set of data.

Danger: Overfitting

As we add more polynomial terms, we are increasing $R^2$ by reducing the sum of squared residuals.

  • We make the function fit the observed responses better and better.

  • Does this necessarily mean that we're better approximating the unknown function $f(x)$? No.

Comparing Our Fits to the Truth

The true conditional expectation function giving rise to our observed data is shown in red on the plot:

\begin{align*} \text{E}[y \mid x] &= x + 2x^2 + x^3 + 10x\sin(1.25x) \end{align*}

Comparing Our Fits to the Truth

What's So Bad About Overfitting?

The Danger of Overfitting

By corresponding too closely, or even exactly, to a particular set of data, the analysis may fail to generalize to additional data, and may fail to predict future observations reliably.

If our approximation $\hat{f}(x)$ fits the data too closely, we'll suffer when asked to use our approximation to predict future values.

What's So Bad About Overfitting?

  • For example, suppose using historical data, we are asked to predict the revenue of a new location for a store.

  • If we cater our predictions too much to the data set at hand, we might do worse when using our model for recommendations.

There's a tradeoff between accounting for true nonlinearities and fitting too closely to the data at hand.

An Illustration

In reality, I generated this data set according to

\begin{align*} y_i &= x_i + 2x_i^2 + x_i^3 + 10x_i\sin(1.25x_i) + \varepsilon_i, \end{align*}

where $\varepsilon_i \overset{iid}{\sim} \mathcal{N}(0, 6)$.

We will consider the following:

  1. Start with a data set of size $50$ generated from this function.

  2. Fit polynomials of varying degrees to our observed data set using least squares. Call the estimated polynomial of degree $j$ $\hat{f}_j(\cdot)$.

An Illustration

  1. For each point $x_i$, simulate a new value $y_i^*$ from the true function at that point $x_i$. This emulates observing a future data point.

  2. Use our approximation to predict this value, and see how well we do.

Each time we will calculate the following:

\begin{align*} RMSE_{predict}(j) &= \sqrt{\frac{1}{n}\sum_{i=1}^n(\hat{f}_j(x_i) - y_i^*)^2} \end{align*}

Visualizing the Tradeoff

Average values for $RMSE_{predict}(j)$ across simulations as a function of the polynomial degree $j$. Smaller values for $RMSE_{predict}(j)$ mean that we did better at predicting future observations.

Visualizing the Tradeoff

The fourth degree polynomial does best. The fifteenth degree polynomial does appreciably worse, despite always resulting in the best $R^2$.

Towards Avoiding Overfitting

The risk of overfitting is present with many statistical methods, and might occur if we

  • Include irrelevant predictor variables.

  • Include unnecessary interaction terms.

  • Allow our functional approximation to be too flexible.

Towards Avoiding Overfitting

This highlights the deficiency of measures such as $R^2$ for assessing model performance.

  • $R^2$ always improves with model complexity in-sample, but that doesn't reflect predictive accuracy for future observations, often called out-of-sample performance.

The previous illustration of overfitting simulated multiple data sets, but of course we're stuck with a single data set.

  • With our single data set, what actions can we take to avoid overfitting?

Breaking Down Prediction Error

A bit of math will provide us with very useful insights for how to form a model based on our data set that best predicts future responses.

In general, suppose our data arise through

\begin{align*} y_i &= f(x_i) + \varepsilon_i \end{align*}
  • $\varepsilon_i$ are $iid$ with $\text{E}[\varepsilon_i] = 0$ and $\text{var}(\varepsilon_i) = \sigma^2_\varepsilon$.

Based on our data we come up with a predictive model $\hat{f}(x)$, taking as input a vector of covariates $x$ and returning a prediction $\hat{y} = \hat{f}(x)$.

  • Example: linear regression yields $\hat{f}(x) = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \dots + \hat{\beta}_p x_p$.

Decomposing Prediction Error

Now, we consider using our prediction function from our data set $\hat{f}(x)$ to predict a new or future observation at a point $\tilde{x}$. Call the unknown future observation $y^*$.

  • A new store opens: what's our best guess for their sales per square foot based upon the model we came up with?

Our prediction, naturally, is what our model tells us given their $x$ value: $\hat{f}(\tilde{x})$.

  • How far off do we expect $y^*$ to be from our prediction $\hat{f}(\tilde{x})$?

Decomposing Prediction Error

\begin{align*} \underbrace{\text{E}[(y^* - \hat{f}(\tilde{x}))^2]}_{\text{Prediction Error}} &= \underbrace{\text{E}[(y^* - f(\tilde{x}))^2]}_{\text{Irreducible Error}} + \underbrace{\text{E}[(\hat{f}(\tilde{x}) - f(\tilde{x}))^2]}_{\text{Reducible Error}} \end{align*}

The irreducible error is simply the variance of $y^*$ given $x$, which equals the variance of $\varepsilon$, namely $\sigma^2_\varepsilon$.

Reducible versus Irreducible Error

Irreducible Error, $\text{E}[(y^* - f(\tilde{x}))^2]$:

  • Even if I knew the true function $f(\tilde{x})$, my responses have intrinsic randomness associated with them.

  • There's nothing we can do about this for a fixed set of covariates. It is inherent noise in the system, and cannot be reduced unless we go out and collect more features.

Reducible Error, $\text{E}[(\hat{f}(\tilde{x}) - f(\tilde{x}))^2]$:

  • How far off is my estimated function $\hat{f}$ expected to be from the true function $f$?

  • This error we can reduce, based on how we choose $\hat{f}$ using our data.

Application: Prediction Intervals in Linear Regression

You've actually seen both of these errors before when forming prediction intervals, even if we didn't call them as such.

Under the stronger linear model, our prediction intervals were based on the variance of $y^* - \hat{\mu}_{y \mid \tilde{x}} = \tilde{x}^\top\hat{\beta}$.

\begin{align*} \text{Var}(y^* - \tilde{x}^\top\hat{\beta}) &= \text{Var}(y^*) + \text{Var}(\tilde{x}^\top\hat{\beta})\\ &= \text{E}[(y^* - \tilde{x}^\top\beta)^2] + \text{E}[(\tilde{x}^\top\hat{\beta} - \tilde{x}^\top\beta)^2] \end{align*}

Note under the stronger linear model, $f(\tilde{x}) = \tilde{x}^\top\beta$.

  • Irreducible error: $\sigma^2_\varepsilon = \text{Var}(y^*)$.

  • Reducible error: $\sigma^2_\varepsilon \tilde{x}^\top (X^\top X)^{-1}\tilde{x} = \text{Var}(\tilde{x}^\top\hat{\beta})$.

Decomposing Reducible Error

In thinking about how to reduce the reducible error, consider a further decomposition.

\begin{align*} \underbrace{\text{E}[(\hat{f}(\tilde{x}) - f(\tilde{x}))^2]}_{\text{Reducible Error}} &= \underbrace{(\text{E}[\hat{f}(\tilde{x})] - f(\tilde{x}))^2}_{\text{Squared Estimator Bias}} + \underbrace{\text{E}[(\hat{f}(\tilde{x}) - \text{E}[\hat{f}(\tilde{x})])^2]}_{\text{Estimator Variance}} \end{align*}

Estimator Bias

  • Is there a systematic difference between $\text{E}[\hat{f}(\tilde{x})]$ and $f(\tilde{x})$?

  • If I fit a line but the relationship is quadratic, then yes. I might systematically under- or over-estimate across data sets.

  • If the relationship was linear and I fit a line, then no. I'd be correct in expectation.

Decomposing Reducible Error

In thinking about how to reduce the reducible error, consider a further decomposition.

\begin{align*} \underbrace{\text{E}[(\hat{f}(\tilde{x}) - f(\tilde{x}))^2]}_{\text{Reducible Error}} &= \underbrace{(\text{E}[\hat{f}(\tilde{x})] - f(\tilde{x}))^2}_{\text{Squared Estimator Bias}} + \underbrace{\text{E}[(\hat{f}(\tilde{x}) - \text{E}[\hat{f}(\tilde{x})])^2]}_{\text{Estimator Variance}} \end{align*}

Estimator Variance

  • How much does my predicted value $\hat{f}(\tilde{x})$ vary from one data set to the next?

  • How stable is my prediction function?

Bias of a Model

  • Bias refers to the error that is introduced by modeling a real-life problem, which is usually extremely complicated, by a much simpler problem.

  • For example, linear regression assumes that there is a linear relationship between $y$ and $x$. If the truth is nonlinear, bias will be present.

  • Bias often stems from a misspecification of the functional form for $f(x)$.

  • The more flexible or complex a model is, the less bias it will generally have.

Variance of a Model

  • Variance refers to how much your estimate for $f(\cdot)$, $\hat{f}(\cdot)$, would change if you had a different data set.

  • How unstable is your estimated function $\hat{f}(x)$?

  • Generally, the more flexible a model is, the more variance it has.

Back to Prediction Intervals in Linear Regression

Note under the stronger linear model, $f(\tilde{x}) = \tilde{x}^\top\beta$ and $y^* = \tilde{x}^\top\beta + \varepsilon$.

  • Irreducible error: $\sigma^2_\varepsilon = \text{Var}(y^*)$.

  • Reducible error: $\sigma^2_\varepsilon \tilde{x}^\top (X^\top X)^{-1}\tilde{x} = \text{Var}(\tilde{x}^\top\hat{\beta})$.

Back to Prediction Intervals in Linear Regression

Decomposing reducible error:

  • Squared Bias: $0$. Under the stronger linear model, $\text{E}(\tilde{x}^\top\hat{\beta}) - \tilde{x}^\top\beta = 0$ for all $\tilde{x}$, since $\text{E}(\hat{\beta}) = \beta$.

  • Variance: $\sigma^2_\varepsilon \tilde{x}^\top (X^\top X)^{-1}\tilde{x}$.

Under the stronger linear model, if we fit a line, the only source of reducible error is the variance of our prediction function.

An Illustration of Prediction Error:
Fitting a Line When Truth is Quadratic

Suppose we fit a line, $\hat{f}(x) = \hat{\beta}_0 + \hat{\beta}_1 x$ when, in reality, $f(x) = \gamma_0 + \gamma_1 x + \gamma_2 x^2$.

On Reducing Reducible Error

Prediction error can thus be decomposed into three terms.

\begin{align*} \underbrace{\text{E}[(y^* - \hat{f}(\tilde{x}))^2]}_{\text{Prediction Error}} &= \underbrace{\text{E}[(y^* - f(\tilde{x}))^2]}_{\text{Irreducible Error}} + \underbrace{(\text{E}[\hat{f}(\tilde{x})] - f(\tilde{x}))^2}_{\text{Squared Estimator Bias}} + \underbrace{\text{E}[(\hat{f}(\tilde{x}) - \text{E}[\hat{f}(\tilde{x})])^2]}_{\text{Estimator Variance}} \end{align*}

The last two terms are the ones that we can hope to control based on how we use our data to estimate $\hat{f}(\tilde{x})$.

  • Ideally, we'd like to minimize both terms.

  • Unfortunately, strategies that minimize one term need not minimize the other.

Targeting One, Not the Other

Stupid Example 1: Suppose I decide "My favorite number is 5, so I'm going to predict $\hat{f}(x) = 5$ regardless of $x$ and no matter how the data looks."

  • What would the estimator variance be?

  • What about estimator bias?

Targeting One, Not the Other

Stupid Example 2: Suppose my data $y_1,\dots,y_n$ are $iid$ with common mean $\mu$ and variance $\sigma^2 = 1$, so there are no covariates. As an estimator for a future observation $y_i^*$, I consider choosing $\hat{f} = y_1$.

That is, I ignore the other $n-1$ observations in my data set and use the first as my estimator.

  • What would the estimator's bias be?

  • What about the estimator's variance?

  • Compare the reducible error of this estimator to that of $\hat{f} = \bar{y} + 1/\sqrt{n}$ for $n \geq 2$.

The Bias-Variance Tradeoff

Again, here's what we saw with prediction $RMSE$ in our polynomial simulation.

The Bias-Variance Tradeoff

What's going on?

  • Increasing the polynomial degree decreases bias. We are getting better approximations of the true function.

  • But increasing the polynomial degree increases variance, as we begin to fit too much to noise through overfitting.

This is a general phenomenon known as the bias-variance tradeoff.

The Bias-Variance Tradeoff

The Bias-Variance Tradeoff

As a general rule, as we increase model complexity:

  • Estimator bias decreases.

  • Estimator variance increases.

The takeaway: we need to think of ways to find an optimal tradeoff between the two.

A Fundamental Picture

  • In general, as model complexity increases, errors in forming predictions for the observed data will decrease.

  • However, errors for future observations will decline at first, as reductions in bias dominate, but will then start to increase again, as increases in variance dominate.

A Fundamental Picture

We must always keep this picture in mind when choosing a model. More flexible or complicated is not always better.

On Accounting for the Tradeoff

The tradeoff shows us that we must avoid overfitting to data.

  • We cannot naively find the best-fitting function to the data set at hand, because we might be fitting to apparent "patterns" that are due to random chance alone.

On Accounting for the Tradeoff

This highlights the deficiencies in $R^2$ as a metric for model performance.

  • It does not account for model complexity at all, and becomes larger and larger with more complex models.

  • It only compares reduction in error within our particular sample.

  • As we consider comparing models of different sizes, we need to consider performance on future data, rather than on the observed data.