Training deep neural nets

DATASCI 415: Statistical Learning and Data Mining

University of Michigan

including slides by Mu Li and Alex Smola

Training deep neural nets

Model capacity
Weight decay
Dropout
Batch normalization (BatchNorm)
Initialization

Weight decay demo

Dropout motivation

Idea: ensure model is robust to small changes in its input by perturbing the inputs during training

This is a form of capacity control: it ensures the model depends smoothly on its inputs.

Batch normalization

$$\begin{aligned} \mu &\gets\textstyle\frac{1}{|\cB|}\sum_{i\in\cB}h_i \\ \sigma^2 &\gets\textstyle\frac{1}{|\cB|}\sum_{i\in\cB}(h_i - \mu)^2 \\ h_{i+1} &\gets\textstyle\gamma\frac1\sigma(h_i - \mu) + \beta \end{aligned}$$

$\beta,\gamma$ are the (trainable) mean and variance parameters
ideal batch size is 64–256

Batch normalization is a form of regularization by noise injection:

$\mu$ is a random shift,
$\beta$ is a random scale.

Batch size cannot be too large (or $\mu$ and $\beta$ are effectively fixed).

Loss minimization

$$\textstyle\min_w L(w)$$

$L$ is the training loss
$w$ is a vector of parameters (e.g. NN weights)

training loop:

randomly select $\cB\subset[n]$
approximate gradient $g_t\gets\frac{1}{|\cB|}\sum_{i\in\cB}\nabla\ell_i(w)$
update $w_{t+1}\gets w_t - \eta_t\nabla L(w_t)$

until converged

OT problem

He initialization

Xavier initialization is for MLPs with identity and sigmoid-type activations. For MLPs with ReLU activations, He et al suggest

$$\frac{\gamma_t^2}{2}(\frac{n_t}{2} + \frac{n_{t-1}}{2}) = 1.$$

Gaussian weights: $w_{i,j}^{(t)}\sim N(0,\frac{4}{n_{t-1}+n_t})$
uniform weights: $w_{i,j}^{(t)}\sim\unif(-(\frac{2}{n_{t-1}+n_t})^{\frac12},(\frac{2}{n_{t-1}+n_t})^{\frac12})$