Degrees of freedom in (statistical) learning

In this post, we work in the fixed design/input (matrix) setting described in the Mallow’s \(C_p\) derivation. Please see the derivation of the problem setup.

Optimism and Efron’s covariance formula

The optimism of a prediction rule \(\widehat{f}\) is the difference between its expected test error conditional on the inputs and its (expected) training error:

\[\textstyle \def\opt{\mathrm{opt}} \opt(\widehat{f}) \triangleq \frac1n\Ex\big[\|y' - \widehat{f}(X)\|_2^2\big] - \frac1n\Ex\big[\|y - \widehat{f}(X)\|_2^2\big].\]

The term optimism was coined by Brad Efron (who also invented the Bootstrap) who studied degrees of freedom throughout his career. He showed that

\[\begin{aligned} \opt(\widehat{f}) &=\textstyle \frac2n\Ex\big[(y - y')^\top\widehat{f}(X)\big] \\ &=\textstyle \frac2n\Ex\big[\sum_{i=1}^n(y_i - y_i')\widehat{f}(x_i)\big] \\ &=\textstyle \frac2n\sum_{i=1}^n\Ex\big[(y_i - f_*(x_i) + f_*(x_i) - y_i')\widehat{f}(x_i)\big] \\ &=\textstyle \frac2n\sum_{i=1}^n\Ex\big[(y_i - f_*(x_i))\widehat{f}(x_i)\big] - \frac2n\cancel{\Ex\big[(y_i' - f_*(x_i))\widehat{f}(x_i)\big]}, \end{aligned}\]

where the second term on the right side of the final expression vanishes because \(\widehat{f}\) and \(y'\) are independent. We note that the remaining term on the right side of the final expression is \(\cov[y_i,\widehat{f}(x_i)]\), so we have

\[\textstyle \opt(\widehat{f}) = \frac2n\sum_{i=1}^n\cov[y_i,\widehat{f}(x_i)].\]

This expression of optimism is called Efron’s covariance formula, and it is very general: it is valid for any design matrix, any vector of (mean zero) error terms, and any prediction rule. The intuition behind the formula is simple: the correlation between the \(y_i\)’s and \(\widehat{f}(x_i)\) is a measure of the flexibility of \(\widehat{f}\) (more flexible \(\widehat{f}\)’s can better approximate the \(y_i\)’s, leading to higher correlations. Intuitively, we expect the optimism to be higher for more flexible prediction rules because they tend to overfit the training data more; Efron’s covariance formula justifies this intuition mathematically.

Degrees of freedom

The degrees of freedom of a prediction rule \(\widehat{f}\) is defined as

\[\textstyle \def\df{\mathrm{df}} \df(\widehat{f}) \triangleq \frac{1}{\sigma^2}\sum_{i=1}^n\cov[y_i,\widehat{f}(x_i)].\]

To see why it is called degrees of freedom, consider the special case of least squares. Recall the predictions from the least squares estimate of the regression coefficient vector is a linear map of the training outputs: \(\widehat{f}(X) = Hy\), where \(H = X(X^\top X)^{-1}X^\top\) is the hat matrix. We can directly compute the degrees of freedom of least squares:

\[\begin{aligned}\textstyle \frac{1}{\sigma^2}\sum_{i=1}^n\cov\big[y_i,\big[Hy\big]_i\big] &=\textstyle \frac{1}{\sigma^2}\sum_{i=1}^n\Ex\big[(y_i - f_*(x_i))\big[Hy\big]_i\big] \\ &=\textstyle \frac{1}{\sigma^2}\Tr\big(\Ex\big[(y - f_*(x))(Hy)^\top\big]\big) \\ &=\textstyle \frac{1}{\sigma^2}\Tr\big(\Ex\big[(y - f_*(x))y^\top\big]H^\top\big) \\ &=\textstyle \frac{1}{\sigma^2}\Tr(\sigma^2 I_nH^\top) \\ &= \Tr(H); \end{aligned}\]

where we recalled the error vector has spherical covariance in the fourth step. Thus the degrees of freedom of least squares is the trace of the hat matrix:

\[\begin{aligned} \Tr(H) &= \Tr(X(X^\top X)^{-1}X^\top) \\ &= \Tr((X^\top X)^{-1}X^\top X) \\ &= \Tr(I_d) \\ &= d; \end{aligned}\]

ie the degrees of freedom of least squares is the dimension of the regression coefficient vector. The first part of the preceding calculation (that shows the degrees of freedom of least squares is the trace of the hat matrix) generalizes to linear smoothers: recall (from problem set 2) that linear smoothers are prediction rules of the form \(\widehat{f}(X) = Sy\) for some smoothing matrix \(S\in\reals^{n\times n}\). Some prominent examples of linear smoothers are \(K\)-nearest neighbors, ridge regression, and smoothing splines.

Going back to the definition of degrees of freedom, we note that it is closely related to the right side of this definition (together with Efron’s covariance formula) implies

\[\textstyle \frac1n\Ex\big[\|y' - \widehat{f}(X)\|_2^2\big] = \frac1n\Ex\big[\|y - \widehat{f}(X)\|_2^2\big] + \frac{2}{n}\sigma^2\df(\widehat{f}). \tag{dfc}\]

This suggests compute \(\df(\widehat{f})\) as a way of correcting the training error to estimate of the test error (in fixed design settings). Mallow’s \(C_p\) estimate (of the test error) is a special case of this idea: specializing (dfc) to least squares yields Mallow’s \(C_p\).

Stein’s unbiased risk estimate (SURE)

Unfortunately, it is often hard to compute the degrees of freedom of a prediction rule directly, so we often settle for estimates of the degrees of freedom. Stein’s unbiased risk estimate is a general way of obtaining unbiased estimates of degrees of freedom in problems where

the predictions \(\widehat{f}(X)\) depend smoothly on the training outputs \(y\), and
the error terms are Gaussian.

Its namesake is Charles Stein, who was a Efron’s colleague at Stanford. In the rest of this section, we assume \(\widehat{y}\triangleq\widehat{f}(X)\) is (weakly) differentiable with respect to the \(y_i\)’s and the \(\eps_i\)’s and \(\eps_i'\)’s are \(N(0,\sigma^2)\).

SURE relies on the following fact called Stein’s lemma (also named after Charles Steins). In its simplest form (for \(Z\sim N(0,1)\)), Stein’s lemma is basically an instance of integration-by-parts:

\[\begin{aligned} \Ex[g'(Z)] &=\textstyle \int g'(z)\varphi(z)dz \\ &=\textstyle g(z)\varphi(z)\big|_{-\infty}^\infty - \int g(z)\varphi'(z)dz \\ &=\textstyle -\int g(z)\varphi'(z)dz \\ &= \Ex\big[g'(Z)] \end{aligned}\]

for any (weakly) differentiable \(g:\reals\to\reals\). Stein’s lemma can be extended (from standard Gaussian random variables) to Gaussian random vectors with spherical covariance: for \(Z\sim N(\mu,\sigma^2I_n)\) and any (weakly) differentiable \(g:\reals^n\to\reals^n\), Stein’s lemma is

\[\textstyle \frac{1}{\sigma^2}\sum_{i=1}^n\Ex\big[(Z_i - \mu_i)g_i(X)^\top\big] = \Ex\big[\sum_{i=1}^n\frac{\partial}{\partial X_i}g_i(X)\big],\]

where \(g_i\) is the \(i\)-th component of \(g\). We recognize the right side as the (expected) divergence of \(g\). Going back to the problem if estimating degrees of freedom, we have

\[\begin{aligned} \df(\widehat{f}) &=\textstyle \frac{1}{\sigma^2}\sum_{i=1}^n\cov[y_i,\widehat{y}_i] \\ &=\textstyle \frac{1}{\sigma^2}\sum_{i=1}^n\Ex\big[(y_i - f_*(x_i))\widehat{y}_i^\top\big] \\ &=\textstyle \Ex\big[\sum_{i=1}^n\frac{\partial\widehat{y}_i}{\partial y_i}\big], \end{aligned}\]

where we used Stein’s lemma in the final step. This is Stein’s unbiased estimate of the degrees of freedom. Plugging it into (dfc) yields Stein’s unbiased risk estimate:

\[\textstyle \frac1n\Ex\big[\|y' - \widehat{y}_i\|_2^2\big] = \frac1n\Ex\big[\|y - \widehat{y}_i\|_2^2\big] + \frac{2}{n}\sigma^2\Ex\big[\sum_{i=1}^n\frac{\partial\widehat{y}_i}{\partial y_i}\big].\]

To wrap up, we show that the trace of the smoothing matrix is an unbiased estimate of the optimism of linear smoothers with SURE. Recall that linear smoothers are prediction rules of the form \(\widehat{y} = Sy\) for some smoothing matrix \(S\in\reals^{n\times n}\). SURE implies

\[\begin{aligned}\textstyle \frac1n\Ex\big[\|y' - \widehat{y}_i\|_2^2\big] &=\textstyle \frac1n\Ex\big[\|y - \widehat{y}_i\|_2^2\big] + \frac{2}{n}\sigma^2\Ex\big[\sum_{i=1}^nS_{i,i}\big] \\ &=\textstyle \frac1n\Ex\big[\|y - \widehat{y}_i\|_2^2\big] + \frac{2}{n}\sigma^2\Tr(S). \end{aligned}\]

Posted on October 21, 2024 from Ann Arbor, MI.