Test-based methods
Criterion-Based methods
Through our discussion of nonlinear regression, we understand the importance of choosing the right level of "complexity."
Too complex: the resulting fit might have high variance and can start fitting to noise.
Too simple: it cannot fully account for true trends in the data.
In linear regression, complexity arises as the number of predictor variables that we choose to include in our model.
We want to include any predictor variable that is truly meaningful for predicting the response; otherwise there would be bias.
We want to ignore irrelevant predictors; they just add noise to our predictions and increase variance.
Q: How can we use our observed data to help select the right number of predictor variables?
Q: How can we choose the right size for our linear model?
Imagine that we have $p$ covariates, and that linearity holds:
\begin{align*} \mathbf{E}(y \mid x) &= x^\top \beta, \end{align*}where $\beta = (\beta_0,\beta_1,\dots,\beta_p)^\top$ as before.
Of the covariates $x_1,\dots,x_p$, we imagine that only some of them have nonzero slope coefficients.
If $\beta_j \neq 0$, we'd like to use that variable for prediction.
If $\beta_j = 0$, we'd like to eliminate this variable from our model. It is irrelevant.
Keeping it in our regression would add noise to predictions without helping us uncover true trends.
Objective: find the subset $\mathcal{M}_m \subseteq \{1,\dots,p\}$ of $m \leq p$ indices for covariates with nonzero slope coefficients.
We proceed imagining we'll always include an intercept coefficient. How many possible models are there?
How many distinct subsets of $\{1,\dots,p\}$ are there?
We proceed imagining we'll always include an intercept coefficient. How many possible models are there?
How many distinct subsets of $\{1,\dots,p\}$ are there?
Answer: $2^p$.
Explodes quickly as $p$ grows.
$p=79$: $2^{79} \approx 6\times 10^{23}$ models (Avogadro's number).
Even for moderate $p$, there is a massive space of possible models to sift through.
For a given model size $0 \leq m \leq p$, we could imagine looking at all $\binom{p}{m}$ models with those $m$ predictor variables. For each possible subset:
Run a regression of $y$ on those $m$ predictor variables.
Calculate the $SSE$ for that model, or equivalently the $R^2$.
Of all models of size $m$, we could then choose the model that minimizes the $SSE$, or equivalently maximizes the $R^2$. Call the minimal value $SSE_m^*$.
For a particular model size $m$, using $R^2$ or $SSE$ is not so bad.
The deficiencies in $R^2$ have to do with comparing models of different complexities, with differing numbers of covariates.
If we did the previous step for each model size $m$, we'd have a collection of best sums of squared error $SSE_0^*,\dots,SSE_p^*$.
We also know that $SSE_0^* \geq SSE_1^* \geq \dots \geq SSE_p^*$. Why?
When building models, we need to account for the fact that in-sample fit will improve as the number of predictors increases.
We'll now describe some common approaches to variable selection.
Hypothesis Test-Based Methods.
Criterion-Based Methods.
The dependent variable is a quantitative measure of disease progression one year after baseline, and we have data on $n=442$ patients.
For now we'll consider 10 baseline variables thought to be predictive of disease progression.
Age, Sex, BMI, and so on.
All covariates have been standardized, so each column has mean zero and standard deviation one.
In this problem, there are $2^{10} = 1024$ possible models.
Here, $p$ is small enough that we can actually enumerate all possible models.
Here, $p$ is also small relative to the number of features in many modern regression applications ("big data").
When running linear regression, recall that a test of the null that $\beta_j=0$ had the following interpretation:
After controlling for the other predictor variables besides $x_j$, $\{x_\ell : \ell \neq j\}$, is there evidence that $x_j$ substantially improves the predictive power of our model?
In other words, do we need to include $x_j$ once we've already included the other $p-1$ variables?
If the null, $\beta_j = 0$, is true, the answer is no.
Under the stronger linear model, we can conduct this hypothesis test using a $t$-statistic.
Remember that testing $\beta_j=0$ is a particular instance of a partial $\mathcal{F}$-Test.
Compare predictive performance of a full model to that of a reduced model where some of the variables have been removed.
The $F$-Statistic
\begin{align*} F_{stat} &= \frac{(SSE_{red}-SSE_{full})/(df_{red} - df_{full})}{SSE_{full}/df_{full}}, \end{align*}where $df$ denotes the residual degrees of freedom in each model.
Fail to reject: no evidence that the more complicated full model provides an improvement in predictive power relative to the reduced model.
If I fail to reject the null, it seems natural to proceed using the reduced model instead of the full model.
Common in practice. Called stepwise regression using $p$-values.
Backwards elimination starts with a linear model including all predictors, and then tries to reduce the model's size.
Start by fitting a regression using all $p$ predictor variables.
Check to see if any $p$-values fall above your significance level $\alpha$.
If some variables have $p$-values larger than $\alpha$, remove the single predictor variable with the largest $p$-value.
Re-fit your regression using the remaining $p-1$ predictors and return to step 2.
Stop when you have a model where all variables have $p$-values below $\alpha$.
Forward selection starts with a model only including the intercept, and then iteratively adds variables to create a larger model.
Start by fitting a regression using only the intercept.
For all predictors not currently in the model, check the $p$-value for those predictors if they are added to the model.
If any of these $p$-values are below $\alpha$, add the one with the smallest $p$-value below $\alpha$.
Refit the model with the new variable added, then return to step 2.
Stop when no new variables would yield $p$-values below $\alpha$.
When we have categorical predictors, we generally consider adding either all of the categories or none of the categories to the model.
Remember: we encode categoricals with $d-1$ dummy variables if there are $d$ categories.
We should not treat these as $d-1$ separate variables, but rather group them all together.
When deciding whether to add categoricals, we should look at the $p$-values for the partial $\mathcal{F}$-test where the categories are included in the full model and excluded in the reduced model.
In the diabetes data set, the models are of the following sizes:
Backwards elimination, $\alpha = 0.05$: use 6 predictor variables.
Forward selection, $\alpha = 0.05$: use 7 predictor variables.
As you can see, the two approaches are not guaranteed to produce the same model.
Even if forward stepwise and backwards elimination both terminate with a model of size $m$, the variables in those models are not guaranteed to be the same.
Reason: multicollinearity, and its impact on the interpretation of slope coefficients.
A few drawbacks of this approach are:
The approach is not guaranteed to provide the model with the best out-of-sample performance; models tend to be too small.
It is impacted by multicollinearity.
Additionally, you cannot take the $p$-values in the chosen model at face value.
This ends up being generally true after model selection, even if you are not using $p$-values to select.
Issue: a massive multiple-comparisons problem. The act of selecting a model compares many $p$-values along the way.
Even if there were truly no signal and all our covariates had no effect, would we expect to return a model with only an intercept?
Watch out for this. People conduct inference after model selection without modification all the time.
Alternative approaches to model selection imagine optimizing an objective function that encourages model fit while penalizing model complexity.
We know that if we consider
\begin{align*} \underset{\text{all models}}{\arg\min}\;\; \sum_{i=1}^n(y_i-\hat{y}_i)^2, \end{align*}In that case, we return the ordinary least squares regression using all $p$ predictors.
Consider instead
\begin{align*} \underset{\text{all models}}{\arg\min}\;\; Criterion, \end{align*}where the criterion:
Decreases as the quality of the fit to the observed data increases.
Increases in the complexity of the model ($m$, the number of included predictors).
For a given model $\mathcal{M}_m$ with $m$ predictors and sum of squared errors $SSE$, the adjusted $R^2$ is defined as
\[ R^2_{adj}(\mathcal{M}_m) = 1 - \frac{SSE(\mathcal{M}_m)/(n-m-1)}{SST/(n-1)} \]c.f. the unadjusted $R^2$:
\[ R^2(\mathcal{M}_m) = 1 - \frac{SSE(\mathcal{M}_m)}{SST}. \]Benefits as a measure of improvement:
As we know, $R^2$ never decreases as we increase the number of covariates, even if they are irrelevant.
$R^2_{adj}$ can decrease as we add more variables. The inclusion of degrees of freedom penalizes models with larger $m$, which makes it more suitable for model selection.
Watch out: $R^2_{adj}$ can be negative.
Consequence: $\hat{\sigma}_\varepsilon(\mathcal{M}_m) = \text{sd}(y)\sqrt{1-R_{adj}^2(\mathcal{M}_m)}$.
Note: maximizing $R^2_{adj}$ is equivalent to minimizing $\hat{\sigma}^2_\varepsilon(\mathcal{M}_m) = SSE(\mathcal{M}_m)/(n-m-1)$.
Consider then choosing a model through
\begin{align*} \underset{\text{all models}}{\arg\min}\;\; SSE(\mathcal{M}_m)/(n-m-1) \end{align*}Promotes quality model fit (numerator).
Penalizes model complexity (denominator decreases as model size $m$ increases).
For a model $\mathcal{M}_m$ with $m$ predictors, Mallows's $C_p$ takes the form
\begin{align*} C_p(\mathcal{M}_m) &= \frac{SSE(\mathcal{M}_m)}{\hat{\sigma}^2_{\varepsilon}(\mathcal{M}_p)} + 2(m+1)-n, \end{align*}where $\hat{\sigma}^2_{\varepsilon}(\mathcal{M}_p)$ is estimated from the model with all $p$ predictor variables.
Intuition: $C_p$ provides an estimate of the scaled sum of the reducible error at the observed values of the covariates:
\begin{align*} \text{Sum of Reducible Error} &= \frac{1}{\sigma^2_{\varepsilon}}\sum_{i=1}^n(\hat{y}_i - x_i^\top\beta)^2,\\ &= \frac{1}{\sigma^2_{\varepsilon}}\sum_{i=1}^n(\hat{f}(x_i) - f(x_i))^2, \end{align*}where $\hat{y}_i = \hat{f}(x_i)$ are the fitted values from model $\mathcal{M}_m$.
In a good fit, $C_p$ should be near or below $m+1$.
Under the stronger linear model, theoretical derivations can be used to derive additional criteria that take complexity into account. For a model $\mathcal{M}_m$ with $m$ predictors:
Akaike Information Criterion
\begin{align*} AIC(\mathcal{M}_m) &= n\log_e\{SSE(\mathcal{M}_m)/n\} + 2m \end{align*}Bayesian Information Criterion
\begin{align*} BIC(\mathcal{M}_m) &= n\log_e\{SSE(\mathcal{M}_m)/n\} + \log_e(n)m \end{align*}Consider finding the model $\mathcal{M}_{m}$ minimizing either of these criteria:
Akaike Information Criterion
\begin{align*} AIC(\mathcal{M}_m) &= n\log_e\{SSE(\mathcal{M}_m)/n\} + 2m \end{align*}Bayesian Information Criterion
\begin{align*} BIC(\mathcal{M}_m) &= n\log_e\{SSE(\mathcal{M}_m)/n\} + \log_e(n)m \end{align*}For $n=8$, $\log_e(n) = 2.079$.
For $n \geq 8$, $BIC$ penalizes larger models with more predictor variables more than $AIC$.
Here are our four suggested criteria to minimize:
Based on Adjusted $R^2$: $SSE(\mathcal{M}_m)/(n-m-1)$.
Mallows's $C_p$: $SSE(\mathcal{M}_m)/\hat{\sigma}^2_\varepsilon(\mathcal{M}_p) + 2(m+1)-n$.
$AIC$: $n\log_e\{SSE(\mathcal{M}_m)/n\} + 2m$.
$BIC$: $n\log_e\{SSE(\mathcal{M}_m)/n\} + \log_e(n)m$.
For a fixed model size $m$, all of these criteria would agree on the same "best model."
The model of size $m$ that yields the smallest $SSE$.
The difference is the nature of the penalty for complexity, namely the penalty on $m$.
For a chosen criterion, if we could evaluate all possible models:
For a given model size $m$, evaluate all $\binom{p}{m}$ models of size $m$ and find the set of $m$ predictors yielding the best value for the criterion. Call it $Criterion_m^*$.
All four criteria we've introduced would agree on the best model of size $m$.
Across values of $m$, find the minimal value for $Criterion_m^*$.
This is where the differences between criteria show up. Different penalties on $m$ yield different models.
This general idea is called Best Subset Selection.
After penalizing complexity using our four criteria, we arrive at models of the following size:
Adjusted $R^2$: 8 predictor variables.
Mallows's $C_p$: 6 predictor variables.
Akaike Information Criterion ($AIC$): 6 predictor variables.
Bayesian Information Criterion ($BIC$): 5 predictor variables.
This ordering is fairly typical. Adjusted $R^2$ chooses larger models, $BIC$ chooses smaller models, and $AIC$/Mallows's $C_p$ falls in between. In fact, the two are asymptotically equivalent for model selection.
Suppose we consider using polynomial regression to accommodate potential nonlinearities with respect to our $p=10$ predictor variables.
Consider a second-degree polynomial: linear, square, and interaction terms.
10 baseline variables (Age, Sex, BMI, and so on).
45 interaction variables such as BMI $\times$ Age, and so on.
9 squared variables (we do not include the square of sex. What's the square of a binary?).
All covariates are then $z$-scored, so each column has mean zero and standard deviation one.
We have $p = 10+45+9 = 64$ covariates.
In this problem, there are now $2^{64} = 1.8\times 10^{19}$ possible models.