DATASCI 415: Statistical Learning and Data Mining
University of Michigan
including slides by Mu Li and Alex Smola
Model capacity
Weight decay
Dropout
Batch normalization (BatchNorm)
Initialization
Idea: ensure model is robust to small changes in its input by perturbing the inputs during training
This is a form of capacity control: it ensures the model depends smoothly on its inputs.
$\beta,\gamma$ are the (trainable) mean and variance parameters
ideal batch size is 64–256
Batch normalization is a form of regularization by noise injection:
Batch size cannot be too large (or $\mu$ and $\beta$ are effectively fixed).
training loop:
until converged
Xavier initialization is for MLPs with identity and sigmoid-type activations. For MLPs with ReLU activations, He et al suggest
$$\frac{\gamma_t^2}{2}(\frac{n_t}{2} + \frac{n_{t-1}}{2}) = 1.$$