Convex optimization problems

STATS 606: Computation and Optimization Methods in Statistics

University of Michigan

including slides by Stephen Boyd and Lieven Vandenberghe

Ex: optimal transport

Given

  1. transport cost function $c:\cX\times\cY\to\reals$
  2. two distributions $P\in\Delta(\cX)$, $Q\in\Delta(\cY)$

find a transport map $\pi:\cX\times\cY\to\reals$ ($\pi(x,y)$ is the mass transported from $x$ to $y$) that

  1. transports $P$ to $Q$:

    $$\textstyle\int_{\cY}\pi(x,y)dxdy = p(x),\quad\int_{\cX}\pi(x,y)dxdy = q(x)$$
  2. minimizes the total transport cost $\int_{\cX\times\cY}c(x,y)\pi(x,y)dxdy$

Ex: optimal transport

If $\cX$ and $\cY$ are finite sets, then the optimal transport problem is an LP:

$$ \begin{aligned} &\min\nolimits_{\Pi\in\reals_+^{m\times n}} &&\Tr(C^\top\Pi) \\ &\subjectto &&\Pi1_n = p \\ & &&\Pi^\top 1_m = q \end{aligned} \tag{OT} $$
  • $\Pi_{i,j}$ is the mass transported from $x_i$ to $y_j$
  • $C_{i,j}$ is the transport cost from $x_i$ to $y_j$
  • $p$, $q$ are the PMFs of $P$, $Q$ respectively

Ex: optimal transport

OT problem

OT map

Ex: support vector machines

Given $\cD\triangleq\{(X_i,Y_i)\}_{i=1}^n\subset\reals^p\times\{-1,1\}$, find the max margin (linear) classifier.

Ex: support vector machines

To (strictly) separate $$ \begin{aligned} \cD_1\triangleq\{X_i\mid(X_i,Y_i)\in\cD,Y_i = 1\},\\ \cD_0\triangleq\{X_i\mid(X_i,Y_i)\in\cD,Y_i = -1\} \end{aligned} $$

with a hyperplane, we require

$$\{Y_i(\beta_0 + \beta^\top X_i) > 0\}_{i=1}^n.$$

Since scaling $\beta_0$, $\beta$ does not change the hyperplane, the preceding constraints are equivalent to

$$\{Y_i(\beta_0 + \beta^\top X_i) \ge 1\}_{i=1}^n.$$

Ex: support vector machines

Fact: The (Euclidean) distance between the hyperplanes

$$ \begin{aligned} \cH_1\triangleq\{x\mid\beta_0 + \beta^\top x = 1\},\\ \cH_{-1}\triangleq\{x\mid\beta_0 + \beta^\top x = -1\} \end{aligned} $$

is $\dist(\cH_1,\cH_{-1}) = \frac{2}{\|\beta\|_2}$.

Thus to separate $\cD_1$ and $\cD_{-1}$ with the maximum margin, we solve the (hard-margin) SVM problem:

$$ \begin{aligned} &\min\nolimits_{\beta_0\in\reals,\beta\in\reals^p} &&\textstyle\frac12\|\beta\|_2^2 \\ &\subjectto && \{Y_i(\beta_0 + \beta^\top X_i) \ge 1\}_{i=1}^n \end{aligned}. \tag{SVM} $$

Ex: support vector machines

If $\cD_1$ and $\cD_{-1}$ are not linearly separable, then (SVM) is infeasible!

To restore feasibility, we add slack variables $\{\xi_i\}_{i=1}^n\subset\reals_+$ to (SVM):

$$ \begin{aligned} &\min\nolimits_{\beta_0\in\reals,\beta\in\reals^p,\xi\in\reals_+^n} &&\textstyle\frac12\|\beta\|_2^2 + C\sum_{i=1}^n\xi_i \\ &\subjectto && \{Y_i(\beta_0 + \beta^\top X_i) \ge 1 - \xi_i\}_{i=1}^n \end{aligned}. $$

This is the soft-margin SVM problem. It is often written as a regularized (empirical) risk minimization problem

$$ \begin{aligned} \textstyle\min_{\beta_0\in\reals,\beta\in\reals^p}\frac12\|\beta\|_2^2 + C\sum_{i=1}^n\ell(Y_i(\beta_0 + \beta^\top X_i)), \\ \ell(z) \triangleq \max\{0,1-z\}. \end{aligned} $$

Ex: Principal component analysis

Given $\{x_i\}_{i=1}^n$, find $k$-dim subspace that best approximates $\{x_i\}_{i=1}^n$:

$$ \begin{aligned} &\min\nolimits_{P\in\symm^n}&&\textstyle\frac12\|X - XP\|_F^2 = \frac12\sum_{i=1}^n\|x_i^\top - x_i^\top P\|_2^2\\ &\subjectto &&P\text{ is a projector} \\ & &&\rank(P) = k \end{aligned} $$

This problem (despite its non-convexity) has a closed-form solution in terms of the singular value decomposition (SVD) of $X$:

$$P_* = V_kV_k^\top,$$

where $X = U\Sigma V^\top$ is SVD of $X$ and $V_k$ is the principal submatrix of $V$.

Ex: Principal component analysis

$$ \begin{aligned} &\left\{\begin{aligned} &\min\nolimits_{P\in\symm^n}&&\textstyle\frac12\|X - XP\|_F^2 \\ &\subjectto &&P\text{ is a projector} \\ & &&\rank(P) = k \end{aligned}\right\} \\ &\quad\equiv \left\{\begin{aligned} &\min\nolimits_{P\in\symm^n}&&\Tr(X^\top XP) \\ &\subjectto &&P\text{ is a projector} \\ & &&\rank(P) = k \end{aligned}\right\} \end{aligned} $$

The feasible set is the (non-convex) set

$$\{P\in\symm^n\mid \lambda_i(P)\in\{0,1\},\Tr(P) = k\}.$$

Ex: Principal component analysis

We relax the PCA problem by replacing the feasible set with its convex hull:

$$ \begin{aligned} \cF_k &\triangleq \{P\in\symm^n\mid \lambda_i(P)\in{\color{red}[0,1]},\Tr(P) = k\} \\ &= \{P\in\symm^n\mid 0\preceq P\preceq I_P,\Tr(P) = k\}. \end{aligned} $$

This set is called the $k$-th order Fanotope.

Remarkably, the relaxed problem has the same optimal solution as the original PCA problem!

Ex: Fastest mixing random walk

Given a graph $\cG\triangleq\{\cV,\cE\}$, find symmetric (edge) weights $W_{i,j}\in[0,1]$ so that the weighted random walk $(X_t)_{t=1}^\infty$

$$ \Pr\{X_{t+1} = v_j\mid X_t = v_i\} = W_{i,j} $$

mixes as quickly as possible.

The matrix $W\in[0,1]^{n\times n}$ ($n\triangleq|\cV|$) satisfies

$$ \begin{aligned} 1_n^\top W = 1_n^\top\text{ (it is stochastic)}, \\ W = W^\top\text{ (it is doubly stochastic)}, \\ W_{i,j} = 0\text{ whenever }(v_i,v_j)\notin\cE. \end{aligned} $$

Ex: Fastest mixing random walk

The mixing time of $X_t$ depends on the second largest eigenvalue modulus (SLEM) of $W$:

$$\mu(W) \triangleq \max\nolimits_{i = 2,\dots,n}|\lambda_i(W)|.$$

Let $\pi_t\in\Delta^{n-1}$ be the distribution of $X_t$ (i.e. $\Pr\{X_t = v_i\} = [\pi_t]_i$), then $\pi_t$ satisfies the recursion

$$\pi_t^\top = \pi_{t-1}^\top W = \dots = \pi_0^\top W^t$$

The smaller the SLEM, the faster the random walk mixes:

$$\textstyle\frac12\|\pi_T - \frac1n1_n\|_1 \le \frac12\sqrt{n}\mu(W)^T.$$

Ex: Fastest mixing random walk

Fastest mixing Markov chain (FMMC) problem [[Boyd et al](https://epubs.siam.org/doi/10.1137/S0036144503423264)]:

$$ \begin{aligned} &\min\nolimits_{W\in[0,1]^{n\times n}} &&\mu(W) \\ &\subjectto && W1_n = 1_n,\quad W = W^\top \\ & &&W_{i,j} = 0\text{ for any }(v_i,v_j)\notin\cE. \end{aligned} $$

$\mu$ is a convex function because $\textstyle\mu(W) = \|W - \frac1n1_n1_n^\top\|_2$.

SDP form of FMMC problem:

$$ \begin{aligned} &\min\nolimits_{W\in[0,1]^{n\times n}} &&t \\ &\subjectto &&\textstyle-tI_n \preceq W - \frac1n1_n1_n^\top \preceq tI \\ & && W1_n = 1_n,\quad W = W^\top \\ & &&W_{i,j} = 0\text{ for any }(v_i,v_j)\notin\cE. \end{aligned} $$