STATS 606: Computation and Optimization Methods in Statistics
University of Michigan
including slides by Stephen Boyd and Lieven Vandenberghe
Lagrangian:
$$\textstyle L(x,\lambda,\nu) \triangleq f_0(x) + \sum_{i=1}^m\lambda_if_i(x) + \sum_{i=1}^p\nu_ih_i(x) $$The Lagrangian encodes the original problem:
$$ \begin{aligned} f(x) &\triangleq \sup\nolimits_{\lambda\in\reals_+^m,\nu\in\reals^p} L(x,\lambda,\nu) \\ &= \begin{cases} f_0(x) &\begin{cases}\{f_i(x) \le 0:\lambda_i\ge 0\}_{i=1}^m \\ \{h_i(x) = 0:\nu_i\in\reals^m\}_{i=1}^p\end{cases} \\ \infty &\text{otherwise} \end{cases} \end{aligned} $$primal function & problem:
$$ \begin{aligned} f(x) &\triangleq \sup\nolimits_{\lambda\in\reals_+^m,\nu\in\reals^p} L(x,\lambda,\nu) \\ p_* &\triangleq \inf\nolimits_xf(x) \\ &= \inf\nolimits_{x\in\reals^n}\sup\nolimits_{\lambda\in\reals_+^m,\nu\in\reals^p} L(x,\lambda,\nu) \end{aligned} $$dual function & problem:
$$ \begin{aligned} g(\lambda,\nu) &\triangleq \inf\nolimits_x L(x,\lambda,\nu) \\ d^* &\triangleq \sup\nolimits_{\lambda\in\reals_+^m,\nu\in\reals^p}g(\lambda,\nu) \\ &= \sup\nolimits_{\lambda\in\reals_+^m,\nu\in\reals^p}\inf\nolimits_{x\in\reals^n} L(x,\lambda,\nu) \end{aligned} $$As long as $p,q\succ 0$, Slater's CQ is satisfied because $pq^\top$ is a strictly feasible point.
OT Lagrangian:
$$ \begin{aligned} L(\Pi,f,g,\Lambda) &\triangleq \Tr(C^\top\Pi) +f^\top(p - \Pi 1_n) + g^\top(q - \Pi^\top 1_m) \\ &\quad{\color{red}-} \Tr(\Lambda^\top\Pi) \end{aligned} $$OT dual function:
$$g(f,g,\Lambda) = \begin{cases}f^\top p + g^\top q & 0 = C - f1_n^\top - 1_mg^\top - \Lambda\\ -\infty & \text{otherwise}\end{cases}$$OT dual problem:
$$ \begin{aligned} &\max\nolimits_{f\in\reals^m,g\in\reals^n} &&f^\top p + g^\top q \\ &\subjectto && f1_n^\top + 1_mg^\top \preceq C \end{aligned} $$SVM problem:
$$ \begin{aligned} &\min\nolimits_{\beta_0\in\reals,\beta\in\reals^p} &&\textstyle\frac12\|\beta\|_2^2 \\ &\subjectto && \{Y_i(\beta_0 + \beta^\top X_i) \ge 1:\alpha_i \ge 0\}_{i=1}^n \end{aligned} $$Slater's CQ is equivalent to the two classes are strictly separable in the training data.
SVM Lagrangian:
$$\textstyle L(\beta_0,\beta,\alpha) \triangleq \frac12\|\beta\|_2^2 + \sum_{i=1}^n\alpha_i(1 - Y_i(\beta_0 + \beta^\top X_i)) $$SVM dual function:
$$ g(\alpha) = \begin{cases}\sum_{i=1}^n\alpha_i - \frac12\sum_{i,j=1}^n\alpha_i\alpha_jY_iY_jX_i^\top X_j &\alpha^\top Y = 0 \\ -\infty &\text{otherwise}\end{cases} $$SVM dual problem:
$$ \begin{aligned} &\max\nolimits_{\alpha\in\reals_+^n} &&\textstyle\sum_{i=1}^n\alpha_i - \frac12\sum_{i,j=1}^n\alpha_i\alpha_jY_iY_jX_i^\top X_j \\ &\subjectto &&\alpha^\top Y = 0. \end{aligned} $$OT problem:
$$ \begin{aligned} &\min\nolimits_{\Pi\in\reals^{m\times n}} &&\Tr(C^\top\Pi) \\ &\subjectto &&\Pi1_n = p &&: f\in\reals^m \\ & &&\Pi^\top 1_m = q &&: g\in\reals^n \\ & &&\Pi\succeq 0 &&:\Lambda\in\reals_+^{m\times n} \end{aligned} $$OT KKT conditions:
$$ \begin{aligned} C - f1_n^\top - 1_ng^\top - \Lambda = 0&&\text{(stationarity)} \\ \Pi1_n = p, \Pi^\top 1_m = q, \Pi\succeq 0 &&\text{(primal feasibility)} \\ \Lambda\succeq 0 &&\text{(dual feasibility)} \\\textstyle \Tr(\Lambda^\top\Pi) = 0 &&\text{(comp. slackness)} \end{aligned} $$OT dual problem:
$$ \begin{aligned} &\max\nolimits_{f\in\reals^m,g\in\reals^n} && f^\top p + g^\top q \\ &\subjectto && f 1_n^\top + 1_mg^\top \preceq C \equiv\{f_i + g_j \le C_{i,j}\}_{i,j = 1}^{m,n} \end{aligned} $$The $c$-concavity argument shows that
$$ \begin{aligned} f_i = \min\nolimits_jC_{i,j} - g_j,\quad i\in[m], \\ g_j = \min\nolimits_iC_{i,j} - f_i,\quad j\in[n]. \end{aligned} $$Complementary slackness + stationarity:
$$ \begin{aligned} \Pi_{i,j} > 0 &\Rightarrow \Lambda_{i,j} = 0 \\ &\equiv f_i + g_j = C_{i,j} \end{aligned} $$entropy regularized OT problem (finite $\cX$ and $\cY$):
$$ \begin{aligned} &\min\nolimits_{\Pi\in\reals^{m\times n}} &&\Tr(C^\top\Pi) - \eps \Tr(\Pi^\top\log\Pi) \\ &\subjectto &&\Pi1_n = p &&: f\in\reals^m \\ & &&\Pi^\top 1_m = q &&: g\in\reals^n \end{aligned} $$The non-negativity constraint is implicit in the domain of log.
entropy regularized OT KKT conditions:
$$ \begin{aligned}\textstyle \Pi = e\diag(\exp(\frac{f}{\eps}))\exp(\frac{C}{\eps})\diag(\exp(\frac{g}{\eps})) &&\text{(stationarity)} \\ \begin{aligned} \Pi1_n = p \\ \Pi^\top1_m = q \end{aligned} &&\text{(primal feasibility)} \end{aligned} $$OT map
entropy regularized OT map
From stationarity,
$$\textstyle \Pi = \diag(u)\exp(\frac1\eps C)\diag(v)\text{ for some }u\in\reals^m,v\in\reals^n. $$From primal feasibility, we see that $u$, $v$ satisfies
$$ \begin{aligned}\textstyle \diag(u)\exp(\frac1\eps C)\diag(v)1_n = p, \\\textstyle \diag(v)\exp(\frac1\eps C)^\top\diag(u)1_m = q. \end{aligned} $$Sinkhorn algorithm:
$$ \begin{aligned}\textstyle u_{t+1} \gets \frac{\diag(u_t)\exp(\frac1\eps C)\diag(v_t)1_n}{p}\\\textstyle v_{t+1} \gets \frac{\diag(v_t)\exp(\frac1\eps C)^\top\diag(u_{t+1})1_m}{q} \end{aligned} $$SVM problem:
$$ \begin{aligned} &\min\nolimits_{\beta_0\in\reals,\beta\in\reals^p} &&\textstyle\frac12\|\beta\|_2^2 \\ &\subjectto && \{1 - Y_i(\beta_0 + \beta^\top X_i) \le 0:\alpha_i \ge 0\}_{i=1}^n \end{aligned} $$SVM KKT conditions:
$$ \begin{aligned} \begin{aligned}\textstyle 0 = \sum_{i=1}^n\alpha_iY_i \\\textstyle \beta = \sum_{i=1}^n\alpha_iX_iY_i \end{aligned}&&\text{(stationarity)} \\ \{Y_i(\beta_0 + \beta^\top X_i) \ge 1\}_{i=1}^n &&\text{(primal feasibility)} \\ \alpha\succeq 0 &&\text{(dual feasibility)} \\\textstyle 0 = \sum_{i=1}^n\alpha_i(1 - Y_i(\beta_0 + \beta^\top X_i)) &&\text{(comp. slackness)} \end{aligned} $$From stationarity
$$\textstyle \beta = \sum_{i=1}^n\alpha_iX_iY_i,\quad\alpha\in\reals_+^n, $$we see that only the $(X_i,Y_i)$'s such that $\alpha_i > 0$ (directly) affect $\beta$.
From complementary slackness
$$\textstyle 0 = \sum_{i=1}^n\alpha_i(1 - Y_i(\beta_0 + \beta^\top X_i)), $$we see that the $\alpha_i > 0$ implies $Y_i(\beta_0 + \beta^\top X_i) = 1$.
Training samples such that $\alpha_i > 0$ (i.e. $Y_i(\beta_0 + \beta^\top X_i) = 1$) are called support vectors because they "support" the decision boundary.
The LASSO problem
$$\textstyle \min_{\beta\in\reals^d}\frac12\|y - X\beta\|_2^2 + \lambda\|\beta\|_1 $$has no constraints, so its dual function is (a) constant.
Consider the equivalent linearly-constrained problem
$$ \begin{aligned} &\min\nolimits_{\widehat{y}\in\reals^n,\beta\in\reals^d} &&\textstyle\frac12\|y-\widehat{y}\|_2^2 + \lambda\|\beta\|_1 \\ &\subjectto &&\widehat{y} = X\beta &&:r\in\reals^n \end{aligned} $$This is a standard trick to obtain a non-trivial dual function/problem for unconstrained problems.
Slater's condition is satisfied (as long as $X \ne 0$).
The dual function is
$$ \begin{aligned} g(r) &\textstyle= \min_{\beta,\widehat{y}}\frac12\|y-\widehat{y}\|_2^2 + \lambda\|\beta\|_1 + r^\top(\widehat{y} - X\beta) \\ &\textstyle= \frac12\|y\|_2^2 - \sup_{\widehat{y}}\{(y-r)^\top\widehat{y} - \frac12\|\widehat{y}\|_2^2\} \\ &\textstyle\quad - \sup_\beta\{(X^\top r)^\top\beta - \lambda\|\beta\|_1\} \\ &\textstyle= \frac12\|y\|_2^2 - \frac12\|y - r\|_2^2 - I_{\lambda\bB_\infty^d}(X^\top r). \end{aligned} $$The dual problem is
$$ \begin{aligned} &\max\nolimits_{r\in\reals^n} &&\textstyle\frac12\|y\|_2^2 - \frac12\|y - r\|_2^2, \\ &\subjectto &&\|X^\top r\|_\infty \le \lambda. \end{aligned} $$The dual problem is the projection of $y$ onto the pre-image of the unit $\ell_\infty$ norm ball under $\frac1\lambda X$:
$$ \begin{aligned} &\min\nolimits_{r\in\reals^n} &&\textstyle\frac12\|y - r\|_2^2, \\ &\subjectto &&\|X^\top r\|_\infty \le \lambda. \end{aligned} $$The optimal value of this version of the dual problem is not equal to the optimal value of the LASSO problem.