DATASCI 415

The method of Lagrange multipliers

In this post, we review how to solve equality constrained optimization problems by hand. Consider the following optimization problem:

\[\begin{aligned} &\min\nolimits_{x\in\reals^n} && f_0(x) \\ &\subjectto && \{f_i(x) = 0:\lambda_i\}_{i=1}^m, \end{aligned} \tag{P}\]

where \(f_0:\reals^n\to\reals\) is the cost function to be minimized, the \(f_i:\reals^d\to\reals\)’s are the equality constraint functions, and the \(\lambda_i\in\reals\)’s are Lagrange multipliers associated with the constraints. Without loss of generality, we assume the right side of the (equality) constraints are zero because we can always (re)define the \(f_i\)’s so that the right side is zero. The set of feasible \(x\)’s (ie the \(x\)’s that satisfy the constraints) is called the feasible set.

The general recipe to solve (P) is:

  1. find the stationary point with respect to \(x\) of the Lagrangian

    $$\textstyle L(x,\lambda) \triangleq f_0(x) + \sum_{i=1}^m\lambda_i f_i(x) $$

    by (partially) differentiating \(L\) with respect to \(x\) and finding the zeros of

    $$\textstyle \partial_xL(x,\lambda) = \partial_xf_0(x) + \sum_{i=1}^m\lambda_i\partial_xf_i(x). $$ To keep things simple, we assume that \(\partial_xL\) has a unique stationary point (with respect to \(x\)). Note that the stationary \(x\) generally depends on \(\lambda\).
  2. Plug the stationary \(x\) into the constraints to find \(\lambda\) such that the stationary \(x\) satisfies the constraints.

To demonstrate the preceding recipe, we solve the constrained form of ridge regression:

\[\begin{aligned} &\min\nolimits_{\beta\in\reals^d} &&\textstyle\frac12\|y - X\beta\|_2^2 \\ &\subjectto && \|\beta\|_2^2 \le t, \end{aligned}\]

where \(X\in\reals^{n\times d}\) is the design matrix, \(y\in\reals^n\) is the output vector, and \(s\ge 0\) is a regularization parameter. Before we use the preceding recipe, we simplify the problem. Let \(\widehat{\beta}\) be the least squares estimate:

\[\def\LS{\mathrm{LS}} \widehat{\beta}_\LS = (X^\top X)^{-1}X^\top y.\]

If \(t > \|\widehat{\beta}_\LS\|_2^2\), then the squared (Euclidean) norm constraint is extraneous because the constrained minimum of the cost function coincides with with the unconstrained minimum. On the other hand, if \(t\le \|\widehat{\beta}_\LS\|_2^2\), then the minimum is attained on the boundary of the feasible set; ie. problems in which squared norm of the argmin is exactly \(t\). In the rest of this post, we focus on the latter (\(t\le \|\widehat{\beta}_\LS\|_2^2\)) case by studying the ridge regression problem with an equality constraint on \(\|\beta\|_2^2\):

\[\begin{aligned} &\min\nolimits_{\beta\in\reals^d} &&\textstyle\frac12\|y - X\beta\|_2^2 \\ &\subjectto && \|\beta\|_2^2 = t, \end{aligned} \tag{P'}\]

The Lagrangian of (P’) is

\[\textstyle L(\beta,\lambda) = \frac12\|y - X\beta\|_2^2 + \lambda(\|\beta\|_2^2 - t).\]

Its gradient with respect to \(\beta\) is

\[\textstyle \partial_\beta L(\beta,\lambda) = X^\top(X\beta - y) + 2\lambda\beta.\]

We set \(\partial_\beta L(\beta,\lambda) = 0\) and solve for \(\beta\) to find the stationary \(\beta\) (in terms of \(\lambda\)):

\[\widehat{\beta}_\lambda\triangleq(X^\top X + 2\lambda)^{-1}X^\top y.\]

Finally, we plug \(\widehat{\beta}_\lambda\) into the constraint to find \(\lambda\) such that \(\widehat{\beta}_\lambda\) satisfies the (squared) norm constraint. This final part of the derivation is specific to the constrained ridge regression problem, so you can safely skip the details without affecting your understanding of the general recipe. We include the details here for completeness. Let \(X = U\Sigma V^\top\) be the singular value decomposition of \(X\). We simplify \(\|\widehat{\beta}_\lambda\|_2^2\) to obtain:

\[\begin{aligned} \|\widehat{\beta}_\lambda\|_2^2 &= \|(X^\top X + 2\lambda)^{-1}X^\top y\|_2^2 \\ &= y^\top X(X^\top X + 2\lambda)^{-2}X^\top y \\ &= y^\top V\Sigma (\Sigma^2 + 2\lambda)^{-2}\Sigma V^\top y \\ &= \frac{\sum_{j=1}^d\sigma_j^2v_j^\top y}{\sum_{j=1}^d\sigma_j^2 + 2d\lambda}, \end{aligned}\]

where the \(\sigma_j\)’s are the singular values and the \(v_j\)’s are the left singular vectors of \(X\). We see that setting \(\lambda = \frac{1}{2d}(\frac1t\sum_{j=1}^d\sigma_j^2v_j^\top y - \sum_{j=1}^d\sigma_j^2)\) ensures the norm constraint is satisfied.

Posted on September 30, 2024 from Ann Arbor, MI.