DATASCI 415: Statistical Learning and Data Mining
University of Michigan
including slides by Mu Li and Alex Smola
(Multi-layer) Perceptrons
Multi-class classification
Optimization basics


regression outputs, predictions: (scalars) $y,\widehat{y}\in\reals$
($K$-class) classification
outputs & predictions are probability distributions over the $K$-classes; i.e. they encode beliefs
regression loss: $\mathrm{SquareLoss}(y,\widehat{y})\triangleq(y-\widehat{y})^2$
($K$-class) classification loss:
$$\textstyle \mathrm{CrossEntropy}(y,\widehat{y})\triangleq -y^\top\log\widehat{y} = -\sum_{k=1}^K y_k\log\widehat{y}_k $$Unlike the square loss, the cross entropy loss is not symmetric; i.e.
$$\mathrm{CrossEntropy}(y_1,y_2) \ne \mathrm{CrossEntropy}(y_2,y_1).$$If $y$ is a one-hot encoding, then $\mathrm{CrossEntropy}(e_k,\widehat{y}) = -\log\widehat{y}_k$. Thus the cross entropy loss is also called the log loss.
regression training: $\min_wL(w)$
$$\begin{aligned} L(w) &\textstyle= \frac1n\sum_{i=1}^n\mathrm{SquareLoss}(y_i,f(x_i;w)) \\ &\textstyle= \frac1n\sum_{i=1}^n(y_i-f(x_i;w))^2 \end{aligned}$$(multi-class) classifier training: $\min_wL(w)$
$$\begin{aligned} L(w) &\textstyle= \frac1n\sum_{i=1}^n\mathrm{CrossEntropy}(y_i,f(x_i;w)) \\ &\textstyle=-\frac1n\sum_{i=1}^ny_i^\top\log f(x_i;w) \end{aligned}$$If the $i$-th example is from the $k$-th class, then $$\mathrm{CrossEntropy}(y_i,f(x_i))= -\log f_k(x_i;w),$$ so cross-entropy minimization "pushes" $f_k(x_i;w)$ towards 1.
MLP with 1 hidden layer:
$$\begin{aligned} h^{(1)} &\gets \sigma(W^{(1)}x + b^{(1)}) \\ o &\gets (w^{(2)})^\top h^{(1)} + b^{(2)} \end{aligned}$$MLP with L hidden layers:
$$\begin{aligned} h^{(1)} &\gets \sigma(W^{(1)}x + b^{(1)}) \\ &\;\;\vdots \\ h^{(L)} &\gets \sigma(W^{(L)}h^{(L-1)} + b^{(L)}) \\ o &\gets (w^{(L+1)})^\top h^{(L)} + b^{(L+1)} \end{aligned}$$
Inputs
Loop
until converged

Loop
until converged
learning rate too small

learning rate too big

Gradient descent is guaranteed to converge to a (local) minimum as long as the learning rates are small enough.
The training loss is an average over training examples:
$$\textstyle L(w) = \frac1n\sum_{i=1}^n\ell_i(w),\quad\ell_i(w)\triangleq\ell(y_i,f(x_i;w)).$$Computing $\nabla L(w)$ can be costly (taking hours or even days) because it entails making a pass over all training examples:
$$\textstyle\nabla L(w) = \frac1n\sum_{i=1}^n\nabla\ell_i(w).$$Idea: approximate $\nabla L(w)$ with a random batch of training examples
$$\textstyle \frac{1}{|\cB|}\sum_{i\in\cB}\nabla\ell_i(w)\approx \nabla L(w)$$Loop
until converged
The batch size is an important hyperparameter