This post supplements the support vector machines slides. Please see the slides for the setup.
In matrix notation, the SVM problem in the slides is
\[\begin{aligned} &\max\nolimits_{\beta_0\in\reals,\beta\in\reals^p,M\ge 0} && M \\ &\subjectto &&\|\beta\|_2 = 1 \\ &&&\{Y_i(\beta_0 + \beta^\top X_i) \ge M\}_{i=1}^n \end{aligned},\]where the constraint \(\|\beta\|_2 = 1\) ensures \(Y_i(\beta_0 + \beta^\top X_i)\) is the margin of the \(i\)-th training sample (WRT to the decision boundary \(\{x\in\reals^d\mid\beta_0 + \beta^\top x = 0\}\)) and \(M\) is a lower bound on the margins of all \(n\) training samples. If we fix \(\beta_0,\beta\) and maximize with respect to \(M\), then the optimal \(M\) is the minimum margin.
To formulate this problem in a more familiar way, we drop the constraint \(\|\beta\|_2 = 1\) and change the margin constraints to
\[\textstyle Y_i(\frac{\beta_0 + \beta^\top X_i}{\|\beta\|_2}) \ge M.\]We divide the left side of the margin constraints by \(\|\beta\|_2\) so that the left side remains the margin to the decision boundary \(\{x\in\reals^d\mid\beta_0 + \beta^\top x = 0\}\) (even when \(\|\beta\|_2 \ne 1\)). The margin constraints are equivalently
\[Y_i(\beta_0 + \beta^\top X_i) \ge M\|\beta\|_2.\]Let \(M' \triangleq M\|\beta\|_2\). In terms of \(\beta_0,\beta,M'\), the margin maximization problem is
\[\begin{aligned} &\max\nolimits_{\beta_0\in\reals,\beta\in\reals^p,M'\ge 0} &&\textstyle\frac{M'}{\|\beta\|_2} \\ &\subjectto &&\{Y_i(\beta_0 + \beta^\top X_i) \ge M'\}_{i=1}^n \end{aligned}.\]This optimization problem is overparameterized because \((\beta_0,\beta,M')\) and \((\alpha\beta_0,\alpha\beta,\alpha M')\) for some \(\alpha > 0\) correspond to the same decision boundary. To identify each decision boundary with a unique tuple, we pick a tuple from each set of equivalent tuples (tuples that correspond to the same decision boundary) to represent each decision boundary. For mathematical convenience, we pick the tuple such that \(M' = 1\). This allows us to simplify the margin maximization problem
\[\begin{aligned} &\max\nolimits_{\beta_0\in\reals,\beta\in\reals^p} &&\textstyle\frac{1}{\|\beta\|_2} \\ &\subjectto && \{Y_i(\beta_0 + \beta^\top X_i) \ge 1\}_{i=1}^n \end{aligned}.\]Finally, we note that maximizing \(\frac{1}{\|\beta\|_2}\) is equivalent to minimizing \(\frac12\|\beta\|_2^2\), which leads to the more familiar form of the margin maximization problem
\[\begin{aligned} &\min\nolimits_{\beta_0\in\reals,\beta\in\reals^p} &&\textstyle\frac12\|\beta\|_2^2 \\ &\subjectto && \{Y_i(\beta_0 + \beta^\top X_i) \ge 1\}_{i=1}^n \end{aligned}.\]If the data is non-separable, then the vanilla margin maximization optimization problem is infeasible. To handle such non-separable data, we relax the (hard) margin constraints to “soft” margin conditions by introducing slack variables \(\xi_1,\dots,\xi_n \ge 0\):
\[\begin{aligned} &\min\nolimits_{\beta_0\in\reals,\beta\in\reals^p,\xi\in\reals_+^n} &&\textstyle\frac12\|\beta\|_2^2 + C\sum_{i=1}^n\xi_i \\ &\subjectto && \{Y_i(\beta_0 + \beta^\top X_i) \ge 1 - \xi_i\}_{i=1}^n \end{aligned},\]where \(C > 0\) is tuning parameter. The slack variables keep the optimization problem feasible even when the data is non-separable by allowing violations of the original margin constraints. That said, we wish to keep such violations to a minimum, so we minimize the sum of the violations.
We can simplify the soft margin maximization problem by minimizing with respect to the slack variables. For fixed \(\beta_0,\beta\), if the original margin constraint is feasible (i.e. \(Y_i(\beta_0 + \beta^\top X_i) \ge 1\)), then the optimal \(\xi_i\) is 0. On the other hand, if the original margin constraint is violated (i.e. \(Y_i(\beta_0 + \beta^\top X_i) < 1\)), then the optimal \(\xi_i\) is \(1 - Y_i(\beta_0 + \beta^\top X_i)\) because this is the smallest \(\xi_i\) that ensures the soft margin constraint is feasible. In summary, the optimal slack variables (for fixed \(\beta_0,\beta\)) are
\[\xi_i = \max\{0,1-Y_i(\beta_0 + \beta^\top X_i)\}.\]Plugging the optimal values of the slack variables into the soft margin maximization problem, we obtain
\[\begin{aligned} \textstyle\min_{\beta_0\in\reals,\beta\in\reals^p}\frac12\|\beta\|_2^2 + C\sum_{i=1}^n\ell(Y_i(\beta_0 + \beta^\top X_i))\\ \ell(z) \triangleq \max\{0,1-z\}, \end{aligned}\]where \(\ell\) is known as the hinge loss function. We see that the soft margin minimization problem penalizes margins that are smaller than 1 (including negative margins). This is intuitive because small margins are undesirable: a small non-negative margin indicates the corresponding sample is close to the decision boundary, while a negative margin indicates the corresponding sample on the wrong side of the decision boundary.
We see that the soft margin maximization problem is a regularized (empirical) risk minimization problem: the first term is a model complexity term (the same model complexity as ridge regression), while the second term is a goodness-of-fit term. Here \(C\) plays the same role as a regularization parameter: it trades-off goodness-of-fit with model complexity.
Posted on August 09, 2024 from Ann Arbor, MI.