Polynomial Regression
Overfitting
Bias-variance trade-off
As we know by now, a key assumption underpinning multiple regression is the assumption of linearity.
\begin{align*} \text{E}(y \mid x) &= \beta_0 + \beta_1 x_1 + \dots + \beta_p x_p \end{align*}In practice, we might instead believe
\begin{align*} \text{E}(y \mid x) &= f(x_1,\dots,x_p). \end{align*}Our real data may be noisy observations from some complicated, nonlinear, multivariate function $f(\cdot)$.
Can we use least squares regression to accommodate nonlinearity?
If there's a particular transformation of $y$ or $x$ that is known or suspected to make the relationship linear, apply it. For example, a $\log$ transformation.
What if we don't know $f(\cdot)$?
Suppose we are presented with the following data set, and are asked to come up with a predictive model for $y$ as a function of $x$.
It's clearly not linear. Instead, it seems that $y_i = f(x_i) + \varepsilon_i$ for some nonlinear function $f(\cdot)$.
Could we still use multiple regression to help us here?
Taylor Series
A function $f$ infinitely differentiable at a point $a$ can be expressed as
\begin{align*} f(x) &= \sum_{n=0}^{\infty}\frac{f^{(n)}(a)}{n!}(x-a)^n, \end{align*}Here $f^{(n)}(\cdot)$ is the $n^{\text{th}}$ derivative of $f(\cdot)$.
Setting $a=0$ above and letting $\beta_j = f^{(j)}(0)/j!$:
\begin{align*} f(x) &= \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \dots\\ &= \sum_{j=0}^{\infty}\beta_j x^j \end{align*}Truncating the infinite sum after $p$ polynomials, we may instead consider
\begin{align*} f(x) &\approx \sum_{j=0}^p \beta_j x^j \end{align*}That is, we can approximate $f(x)$ through a sum of polynomials of increasing degrees.
Suppose we observe noisy data $y_i = f(x_i) + \varepsilon_i$.
Do we know the true coefficients $\beta_j$?
If not, how might we estimate them?
A reasonable approach for finding the best approximating polynomial of degree $p$ would be to run ordinary least squares.
\begin{align*} \min_{\beta_0,\dots,\beta_p}\sum_{i=1}^n\{y_i - (\beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \dots + \beta_p x_i^p)\}^2 \end{align*}Consider creating new covariates in R of the form $x_j = x^j$.
lm(y ~ x1 + ... + xp)
lm(y ~ poly(x, degree = p, raw = TRUE))
Here we aren't assuming anything about a "correct model," so statistical inference is out the window. We are simply trying to best approximate the functional relationship we are observing.
Consider the second degree polynomial:
\begin{align*} \hat{f}(x) &= \hat{\beta}_0 + \hat{\beta}_1 x + \hat{\beta}_2 x^2 \end{align*}The usual way we interpret slope coefficients falls short here.
$\hat{\beta}_2$: "Holding all other variables fixed, two units that differ in $x^2$ by one unit are estimated to differ in $y$ by $\hat{\beta}_2$ units."
This would be nonsense. If the units differ in $x^2$, they also differ in $x$. Can't hold all else equal.
We don't ascribe interpretations to these coefficients, but rather focus mainly on the quality of the resulting fit.
Let's look at the $R^2$ values as we increase the degree.
| Degree | $R^2$ |
|---|---|
| Linear | 0.21 |
| Quadratic | 0.52 |
| Cubic | 0.55 |
| Quartic | 0.72 |
| $\vdots$ | $\vdots$ |
| Fifteenth | 0.78 |
Why is $R^2$ increasing in the degree?
Does this mean we should keep adding polynomial terms?
Overfitting
Overfitting is the production of an analysis that corresponds too closely, or even exactly, to a particular set of data.
As we add more polynomial terms, we are increasing $R^2$ by reducing the sum of squared residuals.
We make the function fit the observed responses better and better.
Does this necessarily mean that we're better approximating the unknown function $f(x)$? No.
The true conditional expectation function giving rise to our observed data is shown in red on the plot:
\begin{align*} \text{E}[y \mid x] &= x + 2x^2 + x^3 + 10x\sin(1.25x) \end{align*}The Danger of Overfitting
By corresponding too closely, or even exactly, to a particular set of data, the analysis may fail to generalize to additional data, and may fail to predict future observations reliably.
If our approximation $\hat{f}(x)$ fits the data too closely, we'll suffer when asked to use our approximation to predict future values.
For example, suppose using historical data, we are asked to predict the revenue of a new location for a store.
If we cater our predictions too much to the data set at hand, we might do worse when using our model for recommendations.
There's a tradeoff between accounting for true nonlinearities and fitting too closely to the data at hand.
In reality, I generated this data set according to
\begin{align*} y_i &= x_i + 2x_i^2 + x_i^3 + 10x_i\sin(1.25x_i) + \varepsilon_i, \end{align*}where $\varepsilon_i \overset{iid}{\sim} \mathcal{N}(0, 6)$.
We will consider the following:
Start with a data set of size $50$ generated from this function.
Fit polynomials of varying degrees to our observed data set using least squares. Call the estimated polynomial of degree $j$ $\hat{f}_j(\cdot)$.
For each point $x_i$, simulate a new value $y_i^*$ from the true function at that point $x_i$. This emulates observing a future data point.
Use our approximation to predict this value, and see how well we do.
Each time we will calculate the following:
\begin{align*} RMSE_{predict}(j) &= \sqrt{\frac{1}{n}\sum_{i=1}^n(\hat{f}_j(x_i) - y_i^*)^2} \end{align*}Average values for $RMSE_{predict}(j)$ across simulations as a function of the polynomial degree $j$. Smaller values for $RMSE_{predict}(j)$ mean that we did better at predicting future observations.
The fourth degree polynomial does best. The fifteenth degree polynomial does appreciably worse, despite always resulting in the best $R^2$.
The risk of overfitting is present with many statistical methods, and might occur if we
Include irrelevant predictor variables.
Include unnecessary interaction terms.
Allow our functional approximation to be too flexible.
This highlights the deficiency of measures such as $R^2$ for assessing model performance.
$R^2$ always improves with model complexity in-sample, but that doesn't reflect predictive accuracy for future observations, often called out-of-sample performance.
The previous illustration of overfitting simulated multiple data sets, but of course we're stuck with a single data set.
With our single data set, what actions can we take to avoid overfitting?
A bit of math will provide us with very useful insights for how to form a model based on our data set that best predicts future responses.
In general, suppose our data arise through
\begin{align*} y_i &= f(x_i) + \varepsilon_i \end{align*}$\varepsilon_i$ are $iid$ with $\text{E}[\varepsilon_i] = 0$ and $\text{var}(\varepsilon_i) = \sigma^2_\varepsilon$.
Based on our data we come up with a predictive model $\hat{f}(x)$, taking as input a vector of covariates $x$ and returning a prediction $\hat{y} = \hat{f}(x)$.
Example: linear regression yields $\hat{f}(x) = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \dots + \hat{\beta}_p x_p$.
Now, we consider using our prediction function from our data set $\hat{f}(x)$ to predict a new or future observation at a point $\tilde{x}$. Call the unknown future observation $y^*$.
A new store opens: what's our best guess for their sales per square foot based upon the model we came up with?
Our prediction, naturally, is what our model tells us given their $x$ value: $\hat{f}(\tilde{x})$.
How far off do we expect $y^*$ to be from our prediction $\hat{f}(\tilde{x})$?
The irreducible error is simply the variance of $y^*$ given $x$, which equals the variance of $\varepsilon$, namely $\sigma^2_\varepsilon$.
Irreducible Error, $\text{E}[(y^* - f(\tilde{x}))^2]$:
Even if I knew the true function $f(\tilde{x})$, my responses have intrinsic randomness associated with them.
There's nothing we can do about this for a fixed set of covariates. It is inherent noise in the system, and cannot be reduced unless we go out and collect more features.
Reducible Error, $\text{E}[(\hat{f}(\tilde{x}) - f(\tilde{x}))^2]$:
How far off is my estimated function $\hat{f}$ expected to be from the true function $f$?
This error we can reduce, based on how we choose $\hat{f}$ using our data.
You've actually seen both of these errors before when forming prediction intervals, even if we didn't call them as such.
Under the stronger linear model, our prediction intervals were based on the variance of $y^* - \hat{\mu}_{y \mid \tilde{x}} = \tilde{x}^\top\hat{\beta}$.
\begin{align*} \text{Var}(y^* - \tilde{x}^\top\hat{\beta}) &= \text{Var}(y^*) + \text{Var}(\tilde{x}^\top\hat{\beta})\\ &= \text{E}[(y^* - \tilde{x}^\top\beta)^2] + \text{E}[(\tilde{x}^\top\hat{\beta} - \tilde{x}^\top\beta)^2] \end{align*}Note under the stronger linear model, $f(\tilde{x}) = \tilde{x}^\top\beta$.
Irreducible error: $\sigma^2_\varepsilon = \text{Var}(y^*)$.
Reducible error: $\sigma^2_\varepsilon \tilde{x}^\top (X^\top X)^{-1}\tilde{x} = \text{Var}(\tilde{x}^\top\hat{\beta})$.
In thinking about how to reduce the reducible error, consider a further decomposition.
\begin{align*} \underbrace{\text{E}[(\hat{f}(\tilde{x}) - f(\tilde{x}))^2]}_{\text{Reducible Error}} &= \underbrace{(\text{E}[\hat{f}(\tilde{x})] - f(\tilde{x}))^2}_{\text{Squared Estimator Bias}} + \underbrace{\text{E}[(\hat{f}(\tilde{x}) - \text{E}[\hat{f}(\tilde{x})])^2]}_{\text{Estimator Variance}} \end{align*}Estimator Bias
Is there a systematic difference between $\text{E}[\hat{f}(\tilde{x})]$ and $f(\tilde{x})$?
If I fit a line but the relationship is quadratic, then yes. I might systematically under- or over-estimate across data sets.
If the relationship was linear and I fit a line, then no. I'd be correct in expectation.
In thinking about how to reduce the reducible error, consider a further decomposition.
\begin{align*} \underbrace{\text{E}[(\hat{f}(\tilde{x}) - f(\tilde{x}))^2]}_{\text{Reducible Error}} &= \underbrace{(\text{E}[\hat{f}(\tilde{x})] - f(\tilde{x}))^2}_{\text{Squared Estimator Bias}} + \underbrace{\text{E}[(\hat{f}(\tilde{x}) - \text{E}[\hat{f}(\tilde{x})])^2]}_{\text{Estimator Variance}} \end{align*}Estimator Variance
How much does my predicted value $\hat{f}(\tilde{x})$ vary from one data set to the next?
How stable is my prediction function?
Bias refers to the error that is introduced by modeling a real-life problem, which is usually extremely complicated, by a much simpler problem.
For example, linear regression assumes that there is a linear relationship between $y$ and $x$. If the truth is nonlinear, bias will be present.
Bias often stems from a misspecification of the functional form for $f(x)$.
The more flexible or complex a model is, the less bias it will generally have.
Variance refers to how much your estimate for $f(\cdot)$, $\hat{f}(\cdot)$, would change if you had a different data set.
How unstable is your estimated function $\hat{f}(x)$?
Generally, the more flexible a model is, the more variance it has.
Note under the stronger linear model, $f(\tilde{x}) = \tilde{x}^\top\beta$ and $y^* = \tilde{x}^\top\beta + \varepsilon$.
Irreducible error: $\sigma^2_\varepsilon = \text{Var}(y^*)$.
Reducible error: $\sigma^2_\varepsilon \tilde{x}^\top (X^\top X)^{-1}\tilde{x} = \text{Var}(\tilde{x}^\top\hat{\beta})$.
Decomposing reducible error:
Squared Bias: $0$. Under the stronger linear model, $\text{E}(\tilde{x}^\top\hat{\beta}) - \tilde{x}^\top\beta = 0$ for all $\tilde{x}$, since $\text{E}(\hat{\beta}) = \beta$.
Variance: $\sigma^2_\varepsilon \tilde{x}^\top (X^\top X)^{-1}\tilde{x}$.
Under the stronger linear model, if we fit a line, the only source of reducible error is the variance of our prediction function.
Suppose we fit a line, $\hat{f}(x) = \hat{\beta}_0 + \hat{\beta}_1 x$ when, in reality, $f(x) = \gamma_0 + \gamma_1 x + \gamma_2 x^2$.
Prediction error can thus be decomposed into three terms.
\begin{align*} \underbrace{\text{E}[(y^* - \hat{f}(\tilde{x}))^2]}_{\text{Prediction Error}} &= \underbrace{\text{E}[(y^* - f(\tilde{x}))^2]}_{\text{Irreducible Error}} + \underbrace{(\text{E}[\hat{f}(\tilde{x})] - f(\tilde{x}))^2}_{\text{Squared Estimator Bias}} + \underbrace{\text{E}[(\hat{f}(\tilde{x}) - \text{E}[\hat{f}(\tilde{x})])^2]}_{\text{Estimator Variance}} \end{align*}The last two terms are the ones that we can hope to control based on how we use our data to estimate $\hat{f}(\tilde{x})$.
Ideally, we'd like to minimize both terms.
Unfortunately, strategies that minimize one term need not minimize the other.
Stupid Example 1: Suppose I decide "My favorite number is 5, so I'm going to predict $\hat{f}(x) = 5$ regardless of $x$ and no matter how the data looks."
What would the estimator variance be?
What about estimator bias?
Stupid Example 2: Suppose my data $y_1,\dots,y_n$ are $iid$ with common mean $\mu$ and variance $\sigma^2 = 1$, so there are no covariates. As an estimator for a future observation $y_i^*$, I consider choosing $\hat{f} = y_1$.
That is, I ignore the other $n-1$ observations in my data set and use the first as my estimator.
What would the estimator's bias be?
What about the estimator's variance?
Compare the reducible error of this estimator to that of $\hat{f} = \bar{y} + 1/\sqrt{n}$ for $n \geq 2$.
Again, here's what we saw with prediction $RMSE$ in our polynomial simulation.
What's going on?
Increasing the polynomial degree decreases bias. We are getting better approximations of the true function.
But increasing the polynomial degree increases variance, as we begin to fit too much to noise through overfitting.
This is a general phenomenon known as the bias-variance tradeoff.
The Bias-Variance Tradeoff
As a general rule, as we increase model complexity:
Estimator bias decreases.
Estimator variance increases.
The takeaway: we need to think of ways to find an optimal tradeoff between the two.
In general, as model complexity increases, errors in forming predictions for the observed data will decrease.
However, errors for future observations will decline at first, as reductions in bias dominate, but will then start to increase again, as increases in variance dominate.
We must always keep this picture in mind when choosing a model. More flexible or complicated is not always better.
The tradeoff shows us that we must avoid overfitting to data.
We cannot naively find the best-fitting function to the data set at hand, because we might be fitting to apparent "patterns" that are due to random chance alone.
This highlights the deficiencies in $R^2$ as a metric for model performance.
It does not account for model complexity at all, and becomes larger and larger with more complex models.
It only compares reduction in error within our particular sample.
As we consider comparing models of different sizes, we need to consider performance on future data, rather than on the observed data.