Duality

STATS 606: Computation and Optimization Methods in Statistics

University of Michigan

including slides by Stephen Boyd and Lieven Vandenberghe

Lagrangian duality summary

$$ \begin{aligned} &\min\nolimits_{x\in\reals^n} &&f_0(x) \\ &\subjectto &&\{f_i(x) \le 0:\lambda_i\ge 0\}_{i=1}^m \\ & &&\{h_i(x) = 0:\nu_i\in\reals^m\}_{i=1}^p \end{aligned} $$

Lagrangian:

$$\textstyle L(x,\lambda,\nu) \triangleq f_0(x) + \sum_{i=1}^m\lambda_if_i(x) + \sum_{i=1}^p\nu_ih_i(x) $$

The Lagrangian encodes the original problem:

$$ \begin{aligned} f(x) &\triangleq \sup\nolimits_{\lambda\in\reals_+^m,\nu\in\reals^p} L(x,\lambda,\nu) \\ &= \begin{cases} f_0(x) &\begin{cases}\{f_i(x) \le 0:\lambda_i\ge 0\}_{i=1}^m \\ \{h_i(x) = 0:\nu_i\in\reals^m\}_{i=1}^p\end{cases} \\ \infty &\text{otherwise} \end{cases} \end{aligned} $$

Lagrangian duality summary

primal function & problem:

$$ \begin{aligned} f(x) &\triangleq \sup\nolimits_{\lambda\in\reals_+^m,\nu\in\reals^p} L(x,\lambda,\nu) \\ p_* &\triangleq \inf\nolimits_xf(x) \\ &= \inf\nolimits_{x\in\reals^n}\sup\nolimits_{\lambda\in\reals_+^m,\nu\in\reals^p} L(x,\lambda,\nu) \end{aligned} $$

dual function & problem:

$$ \begin{aligned} g(\lambda,\nu) &\triangleq \inf\nolimits_x L(x,\lambda,\nu) \\ d^* &\triangleq \sup\nolimits_{\lambda\in\reals_+^m,\nu\in\reals^p}g(\lambda,\nu) \\ &= \sup\nolimits_{\lambda\in\reals_+^m,\nu\in\reals^p}\inf\nolimits_{x\in\reals^n} L(x,\lambda,\nu) \end{aligned} $$

Ex: optimal transport dual

$$ \begin{aligned} &\min\nolimits_{\Pi\in\reals^{m\times n}} &&\Tr(C^\top\Pi) \\ &\subjectto &&\Pi1_n = p &&: f\in\reals^m \\ & &&\Pi^\top 1_m = q &&: g\in\reals^n \\ & &&\Pi\succeq 0 &&:\Lambda\in\reals_+^{m\times n} \end{aligned} $$

$\Pi_{i,j}$ is the mass transported from $x_i$ to $y_j$
$C_{i,j}$ is the transport cost from $x_i$ to $y_j$
$p$, $q$ are the PMFs of $P$, $Q$ respectively

As long as $p,q\succ 0$, Slater's CQ is satisfied because $pq^\top$ is a strictly feasible point.

Ex: optimal transport dual

OT Lagrangian:

$$ \begin{aligned} L(\Pi,f,g,\Lambda) &\triangleq \Tr(C^\top\Pi) +f^\top(p - \Pi 1_n) + g^\top(q - \Pi^\top 1_m) \\ &\quad{\color{red}-} \Tr(\Lambda^\top\Pi) \end{aligned} $$

OT dual function:

$$g(f,g,\Lambda) = \begin{cases}f^\top p + g^\top q & 0 = C - f1_n^\top - 1_mg^\top - \Lambda\\ -\infty & \text{otherwise}\end{cases}$$

OT dual problem:

$$ \begin{aligned} &\max\nolimits_{f\in\reals^m,g\in\reals^n} &&f^\top p + g^\top q \\ &\subjectto && f1_n^\top + 1_mg^\top \preceq C \end{aligned} $$

Ex: SVM dual

SVM problem:

$$ \begin{aligned} &\min\nolimits_{\beta_0\in\reals,\beta\in\reals^p} &&\textstyle\frac12\|\beta\|_2^2 \\ &\subjectto && \{Y_i(\beta_0 + \beta^\top X_i) \ge 1:\alpha_i \ge 0\}_{i=1}^n \end{aligned} $$

$\{(X_i,Y_i)\}_{i=1}^n$: training data
$\ones\{\beta^\top x + \beta_0\}$: linear classifier
$\alpha_i$: Lagrange multipliers

Slater's CQ is equivalent to the two classes are strictly separable in the training data.

Ex: SVM dual

SVM Lagrangian:

$$\textstyle L(\beta_0,\beta,\alpha) \triangleq \frac12\|\beta\|_2^2 + \sum_{i=1}^n\alpha_i(1 - Y_i(\beta_0 + \beta^\top X_i)) $$

SVM dual function:

$$ g(\alpha) = \begin{cases}\sum_{i=1}^n\alpha_i - \frac12\sum_{i,j=1}^n\alpha_i\alpha_jY_iY_jX_i^\top X_j &\alpha^\top Y = 0 \\ -\infty &\text{otherwise}\end{cases} $$

SVM dual problem:

$$ \begin{aligned} &\max\nolimits_{\alpha\in\reals_+^n} &&\textstyle\sum_{i=1}^n\alpha_i - \frac12\sum_{i,j=1}^n\alpha_i\alpha_jY_iY_jX_i^\top X_j \\ &\subjectto &&\alpha^\top Y = 0. \end{aligned} $$

Ex: optimal transport

OT problem:

OT KKT conditions:

$$ \begin{aligned} C - f1_n^\top - 1_ng^\top - \Lambda = 0&&\text{(stationarity)} \\ \Pi1_n = p, \Pi^\top 1_m = q, \Pi\succeq 0 &&\text{(primal feasibility)} \\ \Lambda\succeq 0 &&\text{(dual feasibility)} \\\textstyle \Tr(\Lambda^\top\Pi) = 0 &&\text{(comp. slackness)} \end{aligned} $$

Ex: optimal transport

OT dual problem:

$$ \begin{aligned} &\max\nolimits_{f\in\reals^m,g\in\reals^n} && f^\top p + g^\top q \\ &\subjectto && f 1_n^\top + 1_mg^\top \preceq C \equiv\{f_i + g_j \le C_{i,j}\}_{i,j = 1}^{m,n} \end{aligned} $$

The $c$-concavity argument shows that

$$ \begin{aligned} f_i = \min\nolimits_jC_{i,j} - g_j,\quad i\in[m], \\ g_j = \min\nolimits_iC_{i,j} - f_i,\quad j\in[n]. \end{aligned} $$

Complementary slackness + stationarity:

$$ \begin{aligned} \Pi_{i,j} > 0 &\Rightarrow \Lambda_{i,j} = 0 \\ &\equiv f_i + g_j = C_{i,j} \end{aligned} $$

Ex: optimal transport

OT problem

Ex: Sinkhorn algorithm

entropy regularized OT problem (finite $\cX$ and $\cY$):

$$ \begin{aligned} &\min\nolimits_{\Pi\in\reals^{m\times n}} &&\Tr(C^\top\Pi) - \eps \Tr(\Pi^\top\log\Pi) \\ &\subjectto &&\Pi1_n = p &&: f\in\reals^m \\ & &&\Pi^\top 1_m = q &&: g\in\reals^n \end{aligned} $$

The non-negativity constraint is implicit in the domain of log.

entropy regularized OT KKT conditions:

$$ \begin{aligned}\textstyle \Pi = e\diag(\exp(\frac{f}{\eps}))\exp(\frac{C}{\eps})\diag(\exp(\frac{g}{\eps})) &&\text{(stationarity)} \\ \begin{aligned} \Pi1_n = p \\ \Pi^\top1_m = q \end{aligned} &&\text{(primal feasibility)} \end{aligned} $$

Ex: Sinkhorn algorithm

OT problem

OT map

OT problem

entropy regularized OT map

Ex: Sinkhorn algorithm

From stationarity,

$$\textstyle \Pi = \diag(u)\exp(\frac1\eps C)\diag(v)\text{ for some }u\in\reals^m,v\in\reals^n. $$

From primal feasibility, we see that $u$, $v$ satisfies

$$ \begin{aligned}\textstyle \diag(u)\exp(\frac1\eps C)\diag(v)1_n = p, \\\textstyle \diag(v)\exp(\frac1\eps C)^\top\diag(u)1_m = q. \end{aligned} $$

Sinkhorn algorithm:

$$ \begin{aligned}\textstyle u_{t+1} \gets \frac{\diag(u_t)\exp(\frac1\eps C)\diag(v_t)1_n}{p}\\\textstyle v_{t+1} \gets \frac{\diag(v_t)\exp(\frac1\eps C)^\top\diag(u_{t+1})1_m}{q} \end{aligned} $$

Ex: support vectors

SVM problem:

$$ \begin{aligned} &\min\nolimits_{\beta_0\in\reals,\beta\in\reals^p} &&\textstyle\frac12\|\beta\|_2^2 \\ &\subjectto && \{1 - Y_i(\beta_0 + \beta^\top X_i) \le 0:\alpha_i \ge 0\}_{i=1}^n \end{aligned} $$

SVM KKT conditions:

$$ \begin{aligned} \begin{aligned}\textstyle 0 = \sum_{i=1}^n\alpha_iY_i \\\textstyle \beta = \sum_{i=1}^n\alpha_iX_iY_i \end{aligned}&&\text{(stationarity)} \\ \{Y_i(\beta_0 + \beta^\top X_i) \ge 1\}_{i=1}^n &&\text{(primal feasibility)} \\ \alpha\succeq 0 &&\text{(dual feasibility)} \\\textstyle 0 = \sum_{i=1}^n\alpha_i(1 - Y_i(\beta_0 + \beta^\top X_i)) &&\text{(comp. slackness)} \end{aligned} $$

Ex: support vectors

From stationarity

$$\textstyle \beta = \sum_{i=1}^n\alpha_iX_iY_i,\quad\alpha\in\reals_+^n, $$

we see that only the $(X_i,Y_i)$'s such that $\alpha_i > 0$ (directly) affect $\beta$.

From complementary slackness

$$\textstyle 0 = \sum_{i=1}^n\alpha_i(1 - Y_i(\beta_0 + \beta^\top X_i)), $$

we see that the $\alpha_i > 0$ implies $Y_i(\beta_0 + \beta^\top X_i) = 1$.

Training samples such that $\alpha_i > 0$ (i.e. $Y_i(\beta_0 + \beta^\top X_i) = 1$) are called support vectors because they "support" the decision boundary.

Ex: support vectors

support vectors

Ex: LASSO dual

The LASSO problem

$$\textstyle \min_{\beta\in\reals^d}\frac12\|y - X\beta\|_2^2 + \lambda\|\beta\|_1 $$

has no constraints, so its dual function is (a) constant.

Consider the equivalent linearly-constrained problem

$$ \begin{aligned} &\min\nolimits_{\widehat{y}\in\reals^n,\beta\in\reals^d} &&\textstyle\frac12\|y-\widehat{y}\|_2^2 + \lambda\|\beta\|_1 \\ &\subjectto &&\widehat{y} = X\beta &&:r\in\reals^n \end{aligned} $$

This is a standard trick to obtain a non-trivial dual function/problem for unconstrained problems.

Slater's condition is satisfied (as long as $X \ne 0$).

Ex: LASSO dual

The dual function is

$$ \begin{aligned} g(r) &\textstyle= \min_{\beta,\widehat{y}}\frac12\|y-\widehat{y}\|_2^2 + \lambda\|\beta\|_1 + r^\top(\widehat{y} - X\beta) \\ &\textstyle= \frac12\|y\|_2^2 - \sup_{\widehat{y}}\{(y-r)^\top\widehat{y} - \frac12\|\widehat{y}\|_2^2\} \\ &\textstyle\quad - \sup_\beta\{(X^\top r)^\top\beta - \lambda\|\beta\|_1\} \\ &\textstyle= \frac12\|y\|_2^2 - \frac12\|y - r\|_2^2 - I_{\lambda\bB_\infty^d}(X^\top r). \end{aligned} $$

The dual problem is

$$ \begin{aligned} &\max\nolimits_{r\in\reals^n} &&\textstyle\frac12\|y\|_2^2 - \frac12\|y - r\|_2^2, \\ &\subjectto &&\|X^\top r\|_\infty \le \lambda. \end{aligned} $$

## Ex: LASSO dual

The dual problem is the projection of $y$ onto the pre-image of the unit $\ell_\infty$ norm ball under $\frac1\lambda X$:

$$ \begin{aligned} &\min\nolimits_{r\in\reals^n} &&\textstyle\frac12\|y - r\|_2^2, \\ &\subjectto &&\|X^\top r\|_\infty \le \lambda. \end{aligned} $$

The optimal value of this version of the dual problem is not equal to the optimal value of the LASSO problem.