STATS 413: Lecture 21

  1. Test-based methods

  2. Criterion-Based methods

Model Selection

Through our discussion of nonlinear regression, we understand the importance of choosing the right level of "complexity."

  • Too complex: the resulting fit might have high variance and can start fitting to noise.

  • Too simple: it cannot fully account for true trends in the data.

In linear regression, complexity arises as the number of predictor variables that we choose to include in our model.

  • We want to include any predictor variable that is truly meaningful for predicting the response; otherwise there would be bias.

  • We want to ignore irrelevant predictors; they just add noise to our predictions and increase variance.

Model Selection

Q: How can we use our observed data to help select the right number of predictor variables?

Q: How can we choose the right size for our linear model?

General Setup

Imagine that we have $p$ covariates, and that linearity holds:

\begin{align*} \mathbf{E}(y \mid x) &= x^\top \beta, \end{align*}

where $\beta = (\beta_0,\beta_1,\dots,\beta_p)^\top$ as before.

Of the covariates $x_1,\dots,x_p$, we imagine that only some of them have nonzero slope coefficients.

General Setup

  • If $\beta_j \neq 0$, we'd like to use that variable for prediction.

  • If $\beta_j = 0$, we'd like to eliminate this variable from our model. It is irrelevant.

    • Keeping it in our regression would add noise to predictions without helping us uncover true trends.

Objective: find the subset $\mathcal{M}_m \subseteq \{1,\dots,p\}$ of $m \leq p$ indices for covariates with nonzero slope coefficients.

How Many Possible Models are There?

We proceed imagining we'll always include an intercept coefficient. How many possible models are there?

  • How many distinct subsets of $\{1,\dots,p\}$ are there?

How Many Possible Models are There?

We proceed imagining we'll always include an intercept coefficient. How many possible models are there?

  • How many distinct subsets of $\{1,\dots,p\}$ are there?

Answer: $2^p$.

  • For each variable: do I include it or not (2 options).
  • Make this choice $p$ times: $2^p$.

How Many Possible Models are There?

Explodes quickly as $p$ grows.

  • $p=5$: $2^5=32$ models.
  • $p=20$: $2^{20} \approx$ 1 million models.
  • $p=79$: $2^{79} \approx 6\times 10^{23}$ models (Avogadro's number).

Even for moderate $p$, there is a massive space of possible models to sift through.

Finding the "Best Model" For a Given Model Size

For a given model size $0 \leq m \leq p$, we could imagine looking at all $\binom{p}{m}$ models with those $m$ predictor variables. For each possible subset:

  1. Run a regression of $y$ on those $m$ predictor variables.

  2. Calculate the $SSE$ for that model, or equivalently the $R^2$.

Finding the "Best Model" For a Given Model Size

Of all models of size $m$, we could then choose the model that minimizes the $SSE$, or equivalently maximizes the $R^2$. Call the minimal value $SSE_m^*$.

  • For a particular model size $m$, using $R^2$ or $SSE$ is not so bad.

  • The deficiencies in $R^2$ have to do with comparing models of different complexities, with differing numbers of covariates.

Model Selection

If we did the previous step for each model size $m$, we'd have a collection of best sums of squared error $SSE_0^*,\dots,SSE_p^*$.

  • We also know that $SSE_0^* \geq SSE_1^* \geq \dots \geq SSE_p^*$. Why?

  • When building models, we need to account for the fact that in-sample fit will improve as the number of predictors increases.

We'll now describe some common approaches to variable selection.

  • Hypothesis Test-Based Methods.

  • Criterion-Based Methods.

Diabetes Dataset

The dependent variable is a quantitative measure of disease progression one year after baseline, and we have data on $n=442$ patients.

  • For now we'll consider 10 baseline variables thought to be predictive of disease progression.

  • Age, Sex, BMI, and so on.

  • All covariates have been standardized, so each column has mean zero and standard deviation one.

Diabetes Dataset

In this problem, there are $2^{10} = 1024$ possible models.

  • Here, $p$ is small enough that we can actually enumerate all possible models.

  • Here, $p$ is also small relative to the number of features in many modern regression applications ("big data").

Using Hypothesis Tests

When running linear regression, recall that a test of the null that $\beta_j=0$ had the following interpretation:

After controlling for the other predictor variables besides $x_j$, $\{x_\ell : \ell \neq j\}$, is there evidence that $x_j$ substantially improves the predictive power of our model?

  • In other words, do we need to include $x_j$ once we've already included the other $p-1$ variables?

  • If the null, $\beta_j = 0$, is true, the answer is no.

  • Under the stronger linear model, we can conduct this hypothesis test using a $t$-statistic.

Reminder: Partial $\mathcal{F}$-Tests

Remember that testing $\beta_j=0$ is a particular instance of a partial $\mathcal{F}$-Test.

  • Compare predictive performance of a full model to that of a reduced model where some of the variables have been removed.

The $F$-Statistic

\begin{align*} F_{stat} &= \frac{(SSE_{red}-SSE_{full})/(df_{red} - df_{full})}{SSE_{full}/df_{full}}, \end{align*}

where $df$ denotes the residual degrees of freedom in each model.

Reminder: Partial $\mathcal{F}$-Tests

Fail to reject: no evidence that the more complicated full model provides an improvement in predictive power relative to the reduced model.

  • If I fail to reject the null, it seems natural to proceed using the reduced model instead of the full model.

Common in practice. Called stepwise regression using $p$-values.

Backwards Elimination with Tests

Backwards elimination starts with a linear model including all predictors, and then tries to reduce the model's size.

  1. Start by fitting a regression using all $p$ predictor variables.

  2. Check to see if any $p$-values fall above your significance level $\alpha$.

  3. If some variables have $p$-values larger than $\alpha$, remove the single predictor variable with the largest $p$-value.

  4. Re-fit your regression using the remaining $p-1$ predictors and return to step 2.

  5. Stop when you have a model where all variables have $p$-values below $\alpha$.

Forward Selection with Tests

Forward selection starts with a model only including the intercept, and then iteratively adds variables to create a larger model.

  1. Start by fitting a regression using only the intercept.

  2. For all predictors not currently in the model, check the $p$-value for those predictors if they are added to the model.

  3. If any of these $p$-values are below $\alpha$, add the one with the smallest $p$-value below $\alpha$.

  4. Refit the model with the new variable added, then return to step 2.

  5. Stop when no new variables would yield $p$-values below $\alpha$.

Model Building with Categorical Vars

When we have categorical predictors, we generally consider adding either all of the categories or none of the categories to the model.

  • Remember: we encode categoricals with $d-1$ dummy variables if there are $d$ categories.

  • We should not treat these as $d-1$ separate variables, but rather group them all together.

  • When deciding whether to add categoricals, we should look at the $p$-values for the partial $\mathcal{F}$-test where the categories are included in the full model and excluded in the reduced model.

Results: Diabetes Data

In the diabetes data set, the models are of the following sizes:

  • Backwards elimination, $\alpha = 0.05$: use 6 predictor variables.

  • Forward selection, $\alpha = 0.05$: use 7 predictor variables.

As you can see, the two approaches are not guaranteed to produce the same model.

  • Even if forward stepwise and backwards elimination both terminate with a model of size $m$, the variables in those models are not guaranteed to be the same.

  • Reason: multicollinearity, and its impact on the interpretation of slope coefficients.

Drawbacks of Test-Based Methods

A few drawbacks of this approach are:

  • The approach is not guaranteed to provide the model with the best out-of-sample performance; models tend to be too small.

  • It is impacted by multicollinearity.

Additionally, you cannot take the $p$-values in the chosen model at face value.

Drawbacks of Test-Based Methods

This ends up being generally true after model selection, even if you are not using $p$-values to select.

  • Issue: a massive multiple-comparisons problem. The act of selecting a model compares many $p$-values along the way.

  • Even if there were truly no signal and all our covariates had no effect, would we expect to return a model with only an intercept?

  • Watch out for this. People conduct inference after model selection without modification all the time.

Criterion-Based Methods

Alternative approaches to model selection imagine optimizing an objective function that encourages model fit while penalizing model complexity.

We know that if we consider

\begin{align*} \underset{\text{all models}}{\arg\min}\;\; \sum_{i=1}^n(y_i-\hat{y}_i)^2, \end{align*}

In that case, we return the ordinary least squares regression using all $p$ predictors.

Criterion-Based Methods

Consider instead

\begin{align*} \underset{\text{all models}}{\arg\min}\;\; Criterion, \end{align*}

where the criterion:

  • Decreases as the quality of the fit to the observed data increases.

  • Increases in the complexity of the model ($m$, the number of included predictors).

Adjusted-$R^2$

For a given model $\mathcal{M}_m$ with $m$ predictors and sum of squared errors $SSE$, the adjusted $R^2$ is defined as

\[ R^2_{adj}(\mathcal{M}_m) = 1 - \frac{SSE(\mathcal{M}_m)/(n-m-1)}{SST/(n-1)} \]

c.f. the unadjusted $R^2$:

\[ R^2(\mathcal{M}_m) = 1 - \frac{SSE(\mathcal{M}_m)}{SST}. \]

Adjusted-$R^2$

Benefits as a measure of improvement:

  • As we know, $R^2$ never decreases as we increase the number of covariates, even if they are irrelevant.

  • $R^2_{adj}$ can decrease as we add more variables. The inclusion of degrees of freedom penalizes models with larger $m$, which makes it more suitable for model selection.

  • Watch out: $R^2_{adj}$ can be negative.

Adjusted $R^2$

\begin{align*} R^2_{adj}(\mathcal{M}_m) &= 1 - \frac{SSE(\mathcal{M}_m)/(n-m-1)}{SST/(n-1)} \\ &= 1-\hat{\sigma}^2_\varepsilon(\mathcal{M}_m)/(\text{sd}(y))^2 \end{align*}
  • Consequence: $\hat{\sigma}_\varepsilon(\mathcal{M}_m) = \text{sd}(y)\sqrt{1-R_{adj}^2(\mathcal{M}_m)}$.

  • Note: maximizing $R^2_{adj}$ is equivalent to minimizing $\hat{\sigma}^2_\varepsilon(\mathcal{M}_m) = SSE(\mathcal{M}_m)/(n-m-1)$.

Adjusted $R^2$

Consider then choosing a model through

\begin{align*} \underset{\text{all models}}{\arg\min}\;\; SSE(\mathcal{M}_m)/(n-m-1) \end{align*}
  • Promotes quality model fit (numerator).

  • Penalizes model complexity (denominator decreases as model size $m$ increases).

Mallows's $C_p$

For a model $\mathcal{M}_m$ with $m$ predictors, Mallows's $C_p$ takes the form

\begin{align*} C_p(\mathcal{M}_m) &= \frac{SSE(\mathcal{M}_m)}{\hat{\sigma}^2_{\varepsilon}(\mathcal{M}_p)} + 2(m+1)-n, \end{align*}

where $\hat{\sigma}^2_{\varepsilon}(\mathcal{M}_p)$ is estimated from the model with all $p$ predictor variables.

Mallows's $C_p$

Intuition: $C_p$ provides an estimate of the scaled sum of the reducible error at the observed values of the covariates:

\begin{align*} \text{Sum of Reducible Error} &= \frac{1}{\sigma^2_{\varepsilon}}\sum_{i=1}^n(\hat{y}_i - x_i^\top\beta)^2,\\ &= \frac{1}{\sigma^2_{\varepsilon}}\sum_{i=1}^n(\hat{f}(x_i) - f(x_i))^2, \end{align*}

where $\hat{y}_i = \hat{f}(x_i)$ are the fitted values from model $\mathcal{M}_m$.

  • In a good fit, $C_p$ should be near or below $m+1$.

Information-Based Criteria

Under the stronger linear model, theoretical derivations can be used to derive additional criteria that take complexity into account. For a model $\mathcal{M}_m$ with $m$ predictors:

Akaike Information Criterion

\begin{align*} AIC(\mathcal{M}_m) &= n\log_e\{SSE(\mathcal{M}_m)/n\} + 2m \end{align*}

Bayesian Information Criterion

\begin{align*} BIC(\mathcal{M}_m) &= n\log_e\{SSE(\mathcal{M}_m)/n\} + \log_e(n)m \end{align*}

Information-Based Criterion

Consider finding the model $\mathcal{M}_{m}$ minimizing either of these criteria:

Akaike Information Criterion

\begin{align*} AIC(\mathcal{M}_m) &= n\log_e\{SSE(\mathcal{M}_m)/n\} + 2m \end{align*}

Bayesian Information Criterion

\begin{align*} BIC(\mathcal{M}_m) &= n\log_e\{SSE(\mathcal{M}_m)/n\} + \log_e(n)m \end{align*}
  • For $n=8$, $\log_e(n) = 2.079$.

  • For $n \geq 8$, $BIC$ penalizes larger models with more predictor variables more than $AIC$.

Comparing the Criteria

Here are our four suggested criteria to minimize:

  • Based on Adjusted $R^2$: $SSE(\mathcal{M}_m)/(n-m-1)$.

  • Mallows's $C_p$: $SSE(\mathcal{M}_m)/\hat{\sigma}^2_\varepsilon(\mathcal{M}_p) + 2(m+1)-n$.

  • $AIC$: $n\log_e\{SSE(\mathcal{M}_m)/n\} + 2m$.

  • $BIC$: $n\log_e\{SSE(\mathcal{M}_m)/n\} + \log_e(n)m$.

Comparing the Criteria

For a fixed model size $m$, all of these criteria would agree on the same "best model."

  • The model of size $m$ that yields the smallest $SSE$.

The difference is the nature of the penalty for complexity, namely the penalty on $m$.

Best Subset Selection with Complexity Penalization

For a chosen criterion, if we could evaluate all possible models:

  1. For a given model size $m$, evaluate all $\binom{p}{m}$ models of size $m$ and find the set of $m$ predictors yielding the best value for the criterion. Call it $Criterion_m^*$.

  • All four criteria we've introduced would agree on the best model of size $m$.

  1. Across values of $m$, find the minimal value for $Criterion_m^*$.

  • This is where the differences between criteria show up. Different penalties on $m$ yield different models.

This general idea is called Best Subset Selection.

Results: Diabetes Data

After penalizing complexity using our four criteria, we arrive at models of the following size:

  • Adjusted $R^2$: 8 predictor variables.

  • Mallows's $C_p$: 6 predictor variables.

  • Akaike Information Criterion ($AIC$): 6 predictor variables.

  • Bayesian Information Criterion ($BIC$): 5 predictor variables.

This ordering is fairly typical. Adjusted $R^2$ chooses larger models, $BIC$ chooses smaller models, and $AIC$/Mallows's $C_p$ falls in between. In fact, the two are asymptotically equivalent for model selection.

Extended Diabetes Data

Suppose we consider using polynomial regression to accommodate potential nonlinearities with respect to our $p=10$ predictor variables.

  • Consider a second-degree polynomial: linear, square, and interaction terms.

  • 10 baseline variables (Age, Sex, BMI, and so on).

  • 45 interaction variables such as BMI $\times$ Age, and so on.

Extended Diabetes Data

  • 9 squared variables (we do not include the square of sex. What's the square of a binary?).

  • All covariates are then $z$-scored, so each column has mean zero and standard deviation one.

  • We have $p = 10+45+9 = 64$ covariates.

In this problem, there are now $2^{64} = 1.8\times 10^{19}$ possible models.

Walltime For Extended Diabetes Data