Mallow's Cp derivation

This post supplements the model selection slides. Please see the slides for the setup.

In this post, we show that Mallow’s \(C_p\) on p.20 (of the slides) is an estimate of the test error conditional on the inputs. Equivalently, Mallow’s \(C_p\) is an estimate of the test error in the fixed design/input (matrix) setting. To keep things simple, we assume \(y\) and \(y'\) come from a linear model (but this assumption is not necessary):

\[\begin{aligned} y &= X\beta_* + \eps \\ y' &= X\beta_* + \eps', \end{aligned}\]

where \(y,y'\in\reals^n\) are the training and test output vectors respectively, \(X\in\reals^{n\times d}\) is the (fixed) input matrix, \(\beta_*\in\reals^d\) is the (unknown) regression coefficient vector, and \(\eps,\eps'\in\reals^n\) are error vectors with zero mean (\(\Ex[\,\eps\,] = \Ex\big[\eps'\big] = 0\)) and spherical (co)variance (\(\Ex\big[\eps\eps^\top\big] = \Ex\big[\eps'\eps'^\top\big] = \sigma^2 I_n\)). We wish to show that

\[\textstyle C_p \approx \frac1n\Ex\big[\|y' - X\widehat{\beta}\|_2^2\big],\]

where \(X\in\reals^{n\times d}\) is the (fixed) input matrix, , and \(\widehat{\beta}\) is the least squares estimate of the regression coefficient vector.

We start by showing the gap between the expected training error and test error of the OLS estimator \(\widehat{\beta}\) is \(\frac{2d}{n}\sigma^2\). We have

\[\begin{aligned} &\underbrace{\textstyle\frac1n\Ex\big[\|y' - X\widehat{\beta}\|_2^2\big]}_\text{test error} - \underbrace{\textstyle\Ex\big[\frac1n\|y - X\widehat{\beta}\|_2^2\big]}_\text{expected training error} \\ &\quad=\textstyle\frac1n\Ex\big[\|y'\|_2^2 - 2y'^\top X\widehat{\beta} + \|X\widehat{\beta}\|_2^2\big] - \Ex\big[\frac1n\|y\|_2^2- \frac{2}{n}y^\top X\widehat{\beta} + \frac1n\|X\widehat{\beta}\|_2^2\big] \\ &\quad=\textstyle \frac{2}{n}\Ex\big[(y-y')^\top X\widehat{\beta}\big]. \end{aligned}\]

We plug in the linear model and the expression of the OLS estimator in terms of \(y = X\beta^* + \eps\) to obtain

\[\begin{aligned} &\textstyle\frac{2}{n}\Ex\big[(y-y')^\top X\widehat{\beta}\big] \\ &\quad=\textstyle \frac{2}{n}\Ex\big[(\eps - \eps')^\top X(X^\top X)^{-1}X^\top(X\beta_* + \eps)\big] \\ &\quad=\textstyle \frac{2}{n}(\cancel{\Ex\big[\eps^\top\big]} X(X^\top X)^{-1}X^\top X\beta_* + \Ex\big[\eps^\top X(X^\top X)^{-1}X^\top\eps\big] \\ &\qquad+\cancel{\Ex\big[\eps'^\top\big]}X(X^\top X)^{-1}X^\top X\beta_* + \cancel{\Ex\big[\eps'^\top\big]}X(X^\top X)^{-1}X^\top\cancel{\Ex[\,\eps\,]}), \end{aligned}\]

where the first, third, and fourth terms vanish in the final step because \(\eps\) and \(\eps'\) are independent and their means are zero. We evaluate the second term with the trace trick:

\[\def\tr{\mathrm{tr}} \begin{aligned}\textstyle \frac{2}{n}\Ex\big[\eps^\top X(X^\top X)^{-1}X^\top\eps\big] &=\textstyle \frac{2}{n}\Ex\big[\tr(\eps^\top X(X^\top X)^{-1}X^\top\eps)\big] \\ &=\textstyle \frac{2}{n}\Ex\big[\tr(X(X^\top X)^{-1}X^\top\eps\eps^\top)\big] \\ &=\textstyle \frac{2}{n}\tr(\Ex\big[(X(X^\top X)^{-1}X^\top\eps\eps^\top\big]) \\ &=\textstyle \frac{2}{n}\tr(X(X^\top X)^{-1}X^\top\Ex\big[\eps\eps^\top\big]) \\ &=\textstyle \frac{2}{n}\sigma^2\tr(X(X^\top X)^{-1}X^\top) & \text{(spherical error (co)variance)}\\ &=\textstyle \frac{2}{n}\sigma^2\tr((X^\top X)^{-1}X^\top X) \\ &=\textstyle \frac{2}{n}\sigma^2\tr(I_d), \end{aligned}\]

Thus we conclude

\[\begin{aligned}\textstyle \frac1n\Ex\big[\|y' - X\widehat{\beta}\|_2^2\big] &=\textstyle \Ex\big[\frac1n\|y - X\widehat{\beta}\|_2^2\big] + \frac{2d}{n}\sigma^2. \end{aligned}\]

This suggests we estimate the test error (on the left side) by plugging in estimates of the expected training error and \(\sigma^2\) (on the right side). In particular, if we plug in the RSS for the expected training error (and replace \(\sigma^2\) with an estimator \(\widehat{\sigma}^2\)), we obtain Mallow’s \(C_p\).

Posted on August 09, 2024 from Ann Arbor, MI.