Entropy maximization and exponential families

Consider the entropy maximization problem

\[\begin{aligned} &\max\nolimits_{p\in\cP_h} && H(p) = -\Ex_p\big[\log p(x)\big] \\ &\text{subject to} &&\Ex_p\big[\varphi(X)\big] = \mu \end{aligned}, \tag{maxEnt}\]

where \(\cP_H\) is the set of probability distributions dominated by a base/reference measure \(H\) and \(\varphi(X):\cX\to\reals^d\) is a vector of sufficient statistics. Intuitively, (maxEnt) looks for the probability distribution that (i) agrees with measurements/observations \(\mu\) and (ii) is the most uncertain. The entropy maximization objective is motivated by the principle of maximum entropy, and it intuitively minimizes subjectivity by minimizing the (on top of the measurements) information in the argmax.

To solve entropy maximization problem, we first come up with an ansatz (for the argmax) heuristically and check that it is the argmax. To keep things simple, we assume the base measure \(H\) is the Lebesgue measure (on \(\reals^d\)). With the constraints implicit in the definition of \(\cP_H\) written out, (maxEnt) is

\[\begin{aligned} &\max\nolimits_{p:\cX\to\reals} &&\textstyle-\int_{\cX}p(x)\log p(x)dx \\ &\text{subject to} &&\textstyle\{\int_{\cX}p(x)\phi_j(x)dx = \mu_j&&:\theta_j\}_{j=1}^d \\ &&&\textstyle\int_{\cX}p(x)dx = 1&&:\theta_0 \\ &&&\textstyle\{p(x)\ge 0&&:\lambda(x) \ge 0\}_{x\in\cX} \end{aligned}, \tag{maxEnt}\]

Its Lagrangian is

\[\begin{aligned} L(p,\lambda,\theta,\theta_0) &=\textstyle -\int_{\cX}p(x)\log p(x)dx + \sum_{j=1}^d\theta_j(\int_{\cX}p(x)\varphi(x)dx - \mu_j) \\ &\quad+\textstyle\theta_0\big(\int_{\cX}p(x)dx - 1\big) + \int_{\cX}\lambda(x)p(x)dx, \end{aligned}\]

so its argmax (with respect to \(p\)) satisfies

\[\textstyle 0 = \frac{\partial}{\partial p}L(p^\star,\lambda,\theta,\theta_0) = -\log p^\star(x) - \cancelto{1}{\frac{p(x)}{p(x)}} + \sum_{j=1}^d\theta_j\varphi_j(x) + \theta_0 + \lambda(x).\]

To differentiate the Lagrangian with respect to \(p\), we simply differentiate the preceding expression with respect to (the scalar) \(p(x)\). We rearrange to obtain

\[\textstyle p^\star(x) = \frac{1}{e}\exp(\theta^\top\varphi(x) + \theta_0 + \lambda(x)).\]

We deduce

\(p^\star(x) > 0\), so \(\lambda(x) = 0\) (complementary slackness),
\(\theta_0 = -\log(\frac1e\int_{\cX}\exp(\theta^\top\varphi(x))dx)\) so that \(\int_{\cX}p^\star(x)dx = 1\).

This leads us to the ansatz

\[\begin{aligned}\textstyle p^\star(x) = \exp(\theta^\top\varphi(x) - A(\theta)), \\\textstyle A(\theta) \triangleq \log\int_{\cX}\exp(\theta^\top\varphi(x))dx. \end{aligned}\]

Finally, we check that this ansatz is the argmax. We have

\[\def\KL{\text{KL}} \begin{aligned} H(p) &=\textstyle -\int_{\cX}p(x)\log\frac{p(x)}{p^\star(x)}dx - \int_{\cX}p(x)\log p^\star(x)dx \\ &=\textstyle -\int_{\cX}p(x)\log\frac{p(x)}{p^\star(x)}dx - \int_{\cX}p(x)(\theta^\top\varphi(x) - A(\theta))dx \\ &=\textstyle -\KL(p\|p^\star) - \int_{\cX}p^\star(x)(\theta^\top\varphi(x) - A(\theta))dx \\ &=\textstyle -\KL(p\|p^\star) + H(p^\star), \end{aligned}\]

where we recalled \(\Ex_p[\varphi(X)] = \mu\) for any \(p\in\cP_H\) in the third step:

\[\begin{aligned}\textstyle \int_{\cX}p(x)(\theta^\top\varphi(x) - A(\theta))dx &=\textstyle \theta^\top\int_{\cX}p(x)\varphi(x)dx - A(\theta) \\ &=\textstyle \theta^\top\int_{\cX}p^\star(x)\varphi(x)dx - A(\theta) \\ &=\textstyle \int_{\cX}p^\star(x)(\theta^\top\varphi(x) - A(\theta))dx. \end{aligned}\]

Thus far, we showed that the entropy of any \(p\in\cP_H\) is the entropy of \(p^\star\) less the Kullback-Leibler (KL) divergence between \(p\) and \(p^\star\). The KL divergence is non-negative and zero iff \(p = p^\star\), so we can conclude that \(p^\star\) is the argmax of (maxEnt).

We recognize \(p^\star\) as a member of the exponential family with sufficient statistics \(\varphi(x)\) (and Lesbegue base measure). As the measurements \(\mu\) (of the expected values of the sufficient statistics) vary, the argmax of (maxEnt) is achieved by different members of the exponential family (with different (natural) parameter \(\theta\)).

It is possible to generalize the preceding derivation to include inequality constraints on the moments of \(p\), which leads to the exponential family of distributions. More concretely,

\[\begin{aligned} p^\star(x) = \exp(\theta_1^\top\varphi_1(x) + \theta_2^\top\varphi_2(x) - A(\theta)), \\\textstyle A(\theta) \triangleq \log\int_{\cX}\exp(\theta_1^\top\varphi_1(x) + \theta_2^\top\varphi_2(x))dx, \end{aligned}\]

such that \(\Ex_{p^\star}\big[\varphi_2(X)\big] = \mu_2\) solves the entropy maximization problem

\[\begin{aligned} &\max\nolimits_{p\in\cP_h} && H(p) = -\Ex_p\big[\log p(x)\big] \\ &\text{subject to} &&\Ex_p\big[\varphi_1(X)\big] = \mu_1 \\ &&&\Ex_p\big[\varphi_2(X)\big] \preceq \mu_2 \end{aligned}.\]

We defer the details as an exercise.

Posted on February 24, 2025 from Ann Arbor, MI.