Minimum distance estimation

A natural way to estimate a parameter $\theta$ is to find the value that minimizes the distance between empirical distribution of the data $P_n\triangleq\frac1n\sum_{i=1}^n\delta_{X_i}$ and the distribution in the statistical model with parameter $\theta$:

\[\textstyle \min_{\theta\in\Theta} D(P_n,P_\theta),\]

where $\Theta$ is the parameter space of (statistical) model (ie the set of possible parameters value) and $D$ is a distance between (probability) distributions. This idea is (unsurprisingly) called minimum distance estimation, and many common statistical estimation methods are actually special cases of minimum distance estimation. In the rest of this post, we provide two examples.

Maximum likelihood: The most prominent example of minimum distance estimation is maximum likelihood estimation: as we shall see, maximizing the likelihood is equivalent to minimizing the Kullback-Leibler divergence (KL). The KL divergence between two distributions is

\[\def\KL{\text{KL}} \begin{aligned} \KL(P\| Q) &\triangleq\textstyle \Ex_P\big[\log\frac{dP}{dQ}(X)\big] = \Ex_P\big[-\log\frac{dP}{dQ}(X)\big]; \end{aligned}\]

ie it is the expected likelihood ratio. Strictly speaking, the KL divergence is not a distance because it is not symmetric in its arguments ($\KL(P\|Q) \ne \KL(Q\|P)$), but it does have other distance-like properties. In particular, it is non-negative, and it is only zero if $P = Q$:

\[\def\KL{\text{KL}} \begin{aligned} \KL(P\| Q) &=\textstyle \Ex_P\big[-\log\frac{dQ}{dP}(X)\big] \\ &> \textstyle -\log\Ex_P\big[\frac{dQ}{dP}(X)\big] &\text{(Jensen's inequality)} \\ &=0 &\textstyle\text{($\Ex_P\big[\frac{dQ}{dP}(X)\big] = 1$)}. \end{aligned}\]

There is actually a whole family of distances like the KL divergence collectively called $f$-divergences. Two other prominent $f$-divergences are the Hellinger distance and the total variation (TV) distance.

The KL divergence minimization cost function is

\[\textstyle \KL(P_n\| P_\theta) = \frac1n\sum_{i=1}^n-\log\frac{p_\theta(X_i)}{1/n} = \log n -\frac1n\sum_{i=1}^n\log p_\theta(X_i),\]

which is exactly the negative log-likelihood if we ignore the $\log n$ that does not depend on $\theta$. We see that the maximum likelihood estimator is the minimum KL divergence estimator.

Generative Adversarial Networks: Generative adversarial networks (GANs) are neural nets that generate samples (usually images) by (deterministically) transforming samples from a simple base distribution (eg standard Gaussian random vectors). To train them, we minimize an integral probability metric (IPM) between the empirical distribution of the data and the distribution of the generated samples. IPM’s are distances on probability distributions of the form

\[\def\IPM{\text{IPM}} \IPM_{\cF}(P\|Q) = \max_{f\in\cF}\int_{\cX}f(x)d(P - Q)(x),\]

where $\cF$ is a collection of test functions (not to be confused with test functions in functional analysis): different choices of test functions lead to different IPMs. We call the $f$’s test functions because each $f$ “test” for differences between $P$ and $Q$ in a (fixed) direction. IPMs are convex with respect to their arguments because they are the maximum of linear functions: $\int_{\cX}f(x)d(P - Q)(x)$ is linear in $P$ and $Q$ (for a fixed $f$), so it is convenient to minimize them. Three prominent examples of IPMs are

the optimal transport AKA Wasserstein distance: $\cF$ is the set of 1-Lipschitz functions with respect to the transport cost function,
the aforementioned TV distance: $\cF$ is the set of functions whose sup norm is at most 1.
the mean discrepancy: $\cF$ is a subspace spanned by a collection of feature maps.

To train GANs today, practitioners solve the min-max problem

\[\textstyle \def\base{\text{base}} \min_{G\in\cG}\max_{f\in\cF}\int_{\cX}f(x)d(P_n - G_{\#}P_\base), \tag{GAN}\]

where $G_{\#}P_\base$ is the pushforward of the base distribution $P_\base$ under the generator $G$ (ie the distribution of $G(Z)$, $Z\sim P_\base$). I qualified the preceding statement with today because the first GANs did not (implicitly) minimize an IPM; they actually minimized a symmetric version of the KL divergence. Going back to (GAN), we see that the discriminator plays the role of the test function in the IPM. In practice, the generator and discriminator are neural nets and optimization over the generator and discriminator classes boil down to optimizing the weights of the neural nets. Later in the course, we shall see algorithms for solving such min-max problems.

Posted on February 24, 2025 from Ann Arbor, MI.