SVM problem derivation

This post supplements the support vector machines slides. Please see the slides for the setup.

The margin maximization problem

In matrix notation, the SVM problem in the slides is

\[\begin{aligned} &\max\nolimits_{\beta_0\in\reals,\beta\in\reals^p,M\ge 0} && M \\ &\subjectto &&\|\beta\|_2 = 1 \\ &&&\{Y_i(\beta_0 + \beta^\top X_i) \ge M\}_{i=1}^n \end{aligned},\]

where the constraint \(\|\beta\|_2 = 1\) ensures \(Y_i(\beta_0 + \beta^\top X_i)\) is the margin of the \(i\)-th training sample (WRT to the decision boundary \(\{x\in\reals^d\mid\beta_0 + \beta^\top x = 0\}\)) and \(M\) is a lower bound on the margins of all \(n\) training samples. If we fix \(\beta_0,\beta\) and maximize with respect to \(M\), then the optimal \(M\) is the minimum margin.

To formulate this problem in a more familiar way, we drop the constraint \(\|\beta\|_2 = 1\) and change the margin constraints to

\[\textstyle Y_i(\frac{\beta_0 + \beta^\top X_i}{\|\beta\|_2}) \ge M.\]

We divide the left side of the margin constraints by \(\|\beta\|_2\) so that the left side remains the margin to the decision boundary \(\{x\in\reals^d\mid\beta_0 + \beta^\top x = 0\}\) (even when \(\|\beta\|_2 \ne 1\)). The margin constraints are equivalently

\[Y_i(\beta_0 + \beta^\top X_i) \ge M\|\beta\|_2.\]

Let \(M' \triangleq M\|\beta\|_2\). In terms of \(\beta_0,\beta,M'\), the margin maximization problem is

\[\begin{aligned} &\max\nolimits_{\beta_0\in\reals,\beta\in\reals^p,M'\ge 0} &&\textstyle\frac{M'}{\|\beta\|_2} \\ &\subjectto &&\{Y_i(\beta_0 + \beta^\top X_i) \ge M'\}_{i=1}^n \end{aligned}.\]

This optimization problem is overparameterized because \((\beta_0,\beta,M')\) and \((\alpha\beta_0,\alpha\beta,\alpha M')\) for some \(\alpha > 0\) correspond to the same decision boundary. To identify each decision boundary with a unique tuple, we pick a tuple from each set of equivalent tuples (tuples that correspond to the same decision boundary) to represent each decision boundary. For mathematical convenience, we pick the tuple such that \(M' = 1\). This allows us to simplify the margin maximization problem

\[\begin{aligned} &\max\nolimits_{\beta_0\in\reals,\beta\in\reals^p} &&\textstyle\frac{1}{\|\beta\|_2} \\ &\subjectto && \{Y_i(\beta_0 + \beta^\top X_i) \ge 1\}_{i=1}^n \end{aligned}.\]

Finally, we note that maximizing \(\frac{1}{\|\beta\|_2}\) is equivalent to minimizing \(\frac12\|\beta\|_2^2\), which leads to the more familiar form of the margin maximization problem

\[\begin{aligned} &\min\nolimits_{\beta_0\in\reals,\beta\in\reals^p} &&\textstyle\frac12\|\beta\|_2^2 \\ &\subjectto && \{Y_i(\beta_0 + \beta^\top X_i) \ge 1\}_{i=1}^n \end{aligned}.\]

Soft-margin maximization for non-separable data

If the data is non-separable, then the vanilla margin maximization optimization problem is infeasible. To handle such non-separable data, we relax the (hard) margin constraints to “soft” margin conditions by introducing slack variables \(\xi_1,\dots,\xi_n \ge 0\):

\[\begin{aligned} &\min\nolimits_{\beta_0\in\reals,\beta\in\reals^p,\xi\in\reals_+^n} &&\textstyle\frac12\|\beta\|_2^2 + C\sum_{i=1}^n\xi_i \\ &\subjectto && \{Y_i(\beta_0 + \beta^\top X_i) \ge 1 - \xi_i\}_{i=1}^n \end{aligned},\]

where \(C > 0\) is tuning parameter. The slack variables keep the optimization problem feasible even when the data is non-separable by allowing violations of the original margin constraints. That said, we wish to keep such violations to a minimum, so we minimize the sum of the violations.

We can simplify the soft margin maximization problem by minimizing with respect to the slack variables. For fixed \(\beta_0,\beta\), if the original margin constraint is feasible (i.e. \(Y_i(\beta_0 + \beta^\top X_i) \ge 1\)), then the optimal \(\xi_i\) is 0. On the other hand, if the original margin constraint is violated (i.e. \(Y_i(\beta_0 + \beta^\top X_i) < 1\)), then the optimal \(\xi_i\) is \(1 - Y_i(\beta_0 + \beta^\top X_i)\) because this is the smallest \(\xi_i\) that ensures the soft margin constraint is feasible. In summary, the optimal slack variables (for fixed \(\beta_0,\beta\)) are

\[\xi_i = \max\{0,1-Y_i(\beta_0 + \beta^\top X_i)\}.\]

Plugging the optimal values of the slack variables into the soft margin maximization problem, we obtain

\[\begin{aligned} \textstyle\min_{\beta_0\in\reals,\beta\in\reals^p}\frac12\|\beta\|_2^2 + C\sum_{i=1}^n\ell(Y_i(\beta_0 + \beta^\top X_i))\\ \ell(z) \triangleq \max\{0,1-z\}, \end{aligned}\]

where \(\ell\) is known as the hinge loss function. We see that the soft margin minimization problem penalizes margins that are smaller than 1 (including negative margins). This is intuitive because small margins are undesirable: a small non-negative margin indicates the corresponding sample is close to the decision boundary, while a negative margin indicates the corresponding sample on the wrong side of the decision boundary.

We see that the soft margin maximization problem is a regularized (empirical) risk minimization problem: the first term is a model complexity term (the same model complexity as ridge regression), while the second term is a goodness-of-fit term. Here \(C\) plays the same role as a regularization parameter: it trades-off goodness-of-fit with model complexity.

Posted on August 09, 2024 from Ann Arbor, MI.