AIC and BIC derivations

In this post, we present derivations of Akaike’s Information Criterion (AIC) and Schwarz’s Bayesian Information Criterion (BIC). They presently require familiarity with convergence of random variables and the asymptotic properties of maximum likelihood estimates, so they are not accessible to folks without adequate mathematical statistics background.

Akaike’s Information Criterion (AIC)

AIC estimates the expected deviance of the maximum likelihood estimate:

\[\textstyle d(\widehat{\theta})\triangleq\Ex\big[-2\sum_{i=1}^n\log f(Z_i';\widehat{\theta})\big]\]

where \(\cD'\triangleq\{Z_1',\dots,Z_n'\}\) consists of \(n\) fresh independent samples from the pdf \(f(z;\theta_0)\). The obvious plug-in estimate of \(d(\widehat{\theta})\) is the (in-)sample deviance of the fitted model:

\[\textstyle \widehat{d}(\widehat{\theta}) = -2\sum_{i=1}^n\log f(Z_i;\widehat{\theta}),\]

where \(\cD\triangleq\{Z_1,\dots,Z_n\}\) consists of the \(n\) samples the model is fit to. Unfortunately, \(\widehat{d}(\widehat{\theta})\) is a downward biased estimate of \(d(\widehat{\theta})\) because \(\widehat{\theta}\) is the argmin of \(\widehat{d}(\theta)\). AIC corrects the bias in the (in-)sample deviance to obtain an asymptotically unbiased estimate of the expected deviance.

We split the bias of the (in-)sample deviance into two differences:

\[d(\widehat{\theta}) - \Ex\big[\widehat{d}(\widehat{\theta})\big] = \underbrace{d(\widehat{\theta}) - d(\theta_0)}_{I} + \underbrace{d(\theta_0) - \Ex\big[\widehat{d}(\widehat{\theta})\big]}_{II}.\]

We study the two differences separately. We (Taylor) expand \(d(\widehat{\theta})\) around \(\theta_0\) to obtain:

\[\begin{aligned} I &=\textstyle\cancel{\nabla d(\theta_0)}^\top(\widehat{\theta} - \theta_0) + \frac{1}{2}(\widehat{\theta} - \theta_0)^\top\nabla^2d(\theta_0)(\widehat{\theta} - \theta_0) + o_P(\|\widehat{\theta} - \theta_0\|_2^2)\\ &=\textstyle\cancel{\frac12}(\widehat{\theta} - \theta_0)^\top\Ex\big[-\cancel{2}\sum_{i=1}^n\partial_\theta^2\log f(Z_i';\theta)\mid_{\theta = \theta_0}\big](\widehat{\theta} - \theta_0) + o_P(\|\widehat{\theta} - \theta_0\|_2^2) \\ &=\textstyle n(\widehat{\theta} - \theta_0)^\top\Ex\big[-\partial_\theta^2\log f(Z;\theta)\mid_{\theta = \theta_0}\big](\widehat{\theta} - \theta_0) + o_P(\|\widehat{\theta} - \theta_0\|_2^2). \end{aligned}\]

where the first term in the Taylor expansion vanishes because \(\theta_0\) is the argmin of \(d(\theta)\). We recognize the matrix in the second term as the Fisher information matrix:

\[I(\theta_0) \triangleq -\Ex\big[\partial_\theta^2\{\log f(Z;\theta)\}_{\theta = \theta_0}\big].\]

At this point, we recall the asymptotic distribution of maximum likelihood estimates:

\[n^{\frac12}(\widehat{\theta} - \theta_0)\dto N(0,I(\theta_0)^{-1}).\]

Thus

\[\begin{aligned} I &=\textstyle n(\widehat{\theta} - \theta_0)^\top I(\theta_0)(\widehat{\theta} - \theta_0) + o_P(\|\widehat{\theta} - \theta_0\|_2^2) \\ &= \|n^{\frac12}I(\theta_0)^{\frac12}(\widehat{\theta} - \theta_0)\|_2^2 + o_P(\|\widehat{\theta} - \theta_0\|_2^2) \end{aligned}\]

October 01, 2024