Convolutional neural net basics

DATASCI 415: Statistical Learning and Data Mining

University of Michigan

including slides by Mu Li and Alex Smola

Convolutional neural nets

  1. convolution

  2. pooling

  3. residual connections

  4. handwritten digit recognition with LeNet

2D convolutions

OT problem

$$\begin{aligned} 0\times0+1\times1+3\times2+4\times3=19,\\ 1\times0+2\times1+4\times2+5\times3=25,\\ 3\times0+4\times1+6\times2+7\times3=37,\\ 4\times0+5\times1+7\times2+8\times3=43. \end{aligned}$$

animation by Vincent Dumoulin

2D convolution layers

OT problem

$$Y = X\star W + b$$
  • $X\in\reals^{n_h\times n_w}$: $n_h\times n_w$ input matrix
  • $W\in\reals^{k_h\times k_w}$: convolution kernel matrix parameter
  • $b\in\reals$: bias parameter
  • $Y\in\reals^{m_h\times m_w}$: output matrix
    • $m_h\triangleq n_h-k_h+1$
    • $m_w\triangleq n_w-k_w+1$

1D and 3D convolutions

1D convolutions

$$ y_i = \sum_{a=1}^hw_ax_{i+a} + b $$
  • audio
  • (text)
  • time series

3D convolutions

$$\begin{aligned} y_{i,j,k} &=\sum_{a=1}^h\sum_{b=1}^w\sum_{c=1}^dw_{a,b,c}x_{i+a,j+b,k+c} \\ &\qquad+ b \end{aligned}$$
  • video
  • 3D (medical) images

Convolution demo

Padding

The output of conv layers are smaller its inputs:

  • input size: $n_w\times n_h$ input
  • kernel size: $k_h\times k_w$ kernel
  • output size: $n_h-k_h+1\times n_w-k_w+1$

Output shape shrinks faster with larger kernels!

animation by Vincent Dumoulin

Padding

Padding adds zeros around input to increase output size.

$$0\times0+0\times1+0\times2+0\times3=0$$

animation by Vincent Dumoulin

Padding

Output size from padding $p_h$ rows (total) on the top and bottom and $p_w$ columns on the sides is

$$(n_h - k_h + p_h + 1)\times (n_w - k_w + p_w + 1).$$
  • input size: $n_w\times n_h$ input
  • kernel size: $k_h\times k_w$ kernel

If $k_h$ and $k_w$ are odd, then we pad by

  • $\frac{p_h}{2}$ on the top and bottom,
  • $\frac{p_r}{2}$ on the sides,

where $p_h = k_h - 1$, $p_w = k_w - 1$, to match input and output sizes.

Stride

Stride is the rows and columns per slide/step

Ex: stride of 3 rows and 2 columns

$$\begin{aligned} 0\times0+0\times1+1\times2+2\times3=8\\ 0\times0+6\times1+0\times2+0\times3=6 \end{aligned}$$

animation by Vincent Dumoulin

Stride

Output size from stride of $s_h$ rows and $s_w$ columns is

$$\lfloor(n_h - k_h + p_h + s_h)/s_h\rfloor\times\lfloor(n_w - k_w + p_w + s_w)/s_w\rfloor.$$
  • input size: $n_w\times n_h$ input
  • kernel size: $k_h\times k_w$ kernel
  • padding $p_h$ rows and $p_w$ columns

LeNet demo