Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
This paper proposes a Parametric Rectified Linear Unit (PReLU) and the generalized initialization for a model with ReLU. Xavier initialization is the famous way of initialization and it tries to keep the variance of gradients or the response from neurons. It works well, but this initialization implicitly accusmes linear functions, thus the model with non-linear functions (such as ReLU, tanh, softsign...) needs a different initialization, even though Xavier initialization still works with the model with nonlinearity. The basic idea of ReLU initialization is same as Xavier initialization: keep the variance while propagating.
PReLU
PReLu is defined as:
$ \begin{equation} f(y_i)= \begin{cases}\tag{1} y_i, & \text{if}\ y_i > 0 \newline a_i y_i, & \text{if}\ y_i \le 0 \end{cases} \end{equation} $
Initialization
$ \begin{equation} Var[w_l] = \begin{cases} \frac{2}{(1+a^2)n^l}, & \text{feedforward case} \newline \frac{2}{(1+a^2)n^{l+1}}, & \text{backward case} \newline \frac{4}{(1+a^2)(n^l + n^{l+1})}, & \text{averaged case} \newline \end{cases} \end{equation} $ $ \begin{align} y_i &\text{: the input of the nonlinear activation f on the }i\text{th channel} \newline a_i &\text{: the learnable parameter} \newline w_l &\text{: weights at layer l} \newline n_l &\text{: the number of connections at layer l, details are at (5)} \end{align} $
PReLU with a fixed a_i, typically 0.01, is called leaky ReLU (If a_i is zero, it is called ReLU). The parameter a_i is introduced to avoid zero gradients. The number of adjustable parameters for PReLU is equal to the total number of channels. If a_i is shared on the layer, the number of adjustable parameters is equal to the number of layers.
To optimize a_i, backpropagation is used:
$ \begin{equation}\tag{2} \frac{\delta L}{\delta a_i} = \sum\limits_{y_i} \frac{\delta L}{\delta f(y_i)} \frac{\delta f(y_i)}{\delta a_i} \end{equation} $ $L \text{: a cost function }$
Gradients are defined as:
$ \begin{equation} \frac{\delta f(y_i)}{\delta a_i} = \begin{cases}\tag{3} 0, & \text{if}\ y_i > 0 \newline y_i, & \text{if}\ y_i \le 0 \end{cases} \end{equation} $
The update rule can have the momentum: $ \tag{4} \Delta a_i := \mu \Delta a_i + \epsilon \frac{\delta L}{\delta a_i} $
$ \begin{align} \mu &\text{: the momentum} \newline \epsilon &\text{: the learning rate} \end{align} $
Note that a weight decay tends to push a_i towards zero, thus PReLU might become ReLU. That is why a weight decay is not used into the update rule. This paper uses a_i = 0.25 as the initialization.
Architecture: small model
The more increasing the depth, the smaller coefficients. This phenomenon implies nonlinearity at higher layers and efforts to keep information at lower layer.
Architecture: big model
THe difference of model B and model C is the width of layers.
Imagenet result
The model that all ReLU are replaced by PReLU gets 1.2% more accuracy at Imagenet2012.
Initialization
All models have weights and biases, and they are needed to be initialized before leraning. The way of initialization is important. It dominates the maximum accuracy of a model. Gaussian distribution, uniform distribution or Xavier initialization is used often for the initialization, and especially Xavier initialization is famous, but it assumes that a model only uses linear functions, thus a model with ReLU needs a different initialization. This paper suggests the way of initialization for a model with ReLU. Let me calculate it. k is the spatial size of the layer and c is the input channels. At layer l, the number of connections n: $ n_l = k_l^2 c_l \tag{5}$
$s^l = z^l w^l + b^l\tag{6}$ $z^{l+1} = f(s^l)\tag{7}$ s are the responses at a pixel of the output map, W are weights and b are biases. z is activation values and f is an activation function. Backpropagation is described as: $s^{l+1} = z^{l+1} W^{l+1} + b^{l+1} = f(s^l) W^{l+1} + b^{l+1}\tag{8}$ $\frac{\delta s^{l+1}}{\delta s^l} = \frac{\delta f}{\delta s^l} W^{l+1}\tag{9}$ $\frac{\delta s^{l}}{\delta W^l} = z^{l}\tag{10}$ $\frac{\delta L}{\delta s^l} = \frac{\delta L}{\delta s^{l+1}}\frac{\delta s^{l+1}}{\delta s^l} = \frac{\delta f}{\delta s^l} W^{l+1} \frac{\delta L}{\delta s^{l+1}}\tag{11}$ $\frac{\delta L}{\delta W^l} = \frac{\delta L}{\delta s^{l}}\frac{\delta s^{l}}{\delta W^l} = z^l \frac{\delta f}{\delta s^l} W^{l+1} \frac{\delta L}{\delta s^{l+1}}\tag{12}$ L is a cost function.
If X and Y are i.i.d:
$ \begin{align} Var[XY] &= [E[X]]^2 Var[Y] + [E[Y]]^2 Var[X] + Var[X]Var[Y] \newline &= E[X^2]E[Y^2] - [E[X]]^2 [E[Y]]^2 \tag{13} \end{align} $
$Var[aX \pm bY] = a^2 Var[X] + b^2 Var[Y] \pm 2ab Cov[X.Y]\tag{14}$ $ \begin{align} Cov[X,Y] &= E[[X - E[x]][Y-E[Y]]] \newline &= E[XY] - E[X]E[Y]\tag{15} \end{align} $
Firstly consider the variance of feedforward propagation: Assume b are zero, w and s are zero-mean symmetric distributions, and the activation function f is ReLU. $E[z^{l}] = E[f(s^{l-1})] \neq 0 \tag{16}$ $E[{z^l}^2] = \frac{1}{2} Var[s^{l+1}] = \frac{1}{2} E[{s^{l+1}}^2] \tag{17}$
From (5), (6), (14), (15), (17):
$ \begin{align} Var[s^l] &= n^l Var[z^lw^l + b^l] \newline &= n^l Var[z^lw^l] \newline &= n^l E[{z^l}^2] E[{w^l}^2] \newline &= n^l E[{z^l}^2] Var[W^l]\newline &= \frac{1}{2} n^l Var[s^{l+1}] Var[W^l] \end{align}\tag{18} $
Thus, $Var[s^l] = Var[s^1](\prod\limits_{l=2}^{L} \frac{1}{2} n^l Var[w_l]) \tag{19}$ To keep flowing information, $\frac{1}{2}n^l Var[w^l] = 1, \, \forall l. \tag{20}$
Secondly consider the variance of backward propagation: From (6), (11), $ \begin{align} \frac{\delta L}{\delta z^{l+1}} &= \frac{\delta L}{\delta s^{i+1}} \frac{\delta s^{i+1}}{\delta z^{i+1}} \newline &= \frac{1}{w^{l+1}\frac{\delta f}{\delta s^l}} \frac{\delta L}{\delta s^l} \times w^{l+1} \newline &= \frac{\delta f}{\delta s^l}^{-1} \frac{\delta L}{\delta s^l} \end{align}\tag{21} $
From (6), (21),
$ \begin{align}\tag{22} \frac{\delta L}{\delta z^l} &= \frac{\delta L}{\delta s^l} \frac{\delta s^l}{\delta z^l} \newline &= w^l\frac{\delta L}{\delta s^l} \newline &= w^l \frac{\delta f}{\delta s^l}\frac{\delta L}{\delta z^{l+1}} \end{align} $
From (5), (22), $Var[\frac{\delta L}{\delta z^l}] = n^{l+1}Var[w^l \frac{\delta f}{\delta s^l}\frac{\delta L}{\delta z^{l+1}}\tag{23}]$ Assume, $\forall l, \; E[\frac{\delta f}{\delta s^l}] = \frac{1}{2}, Var[\frac{\delta f}{\delta s^l}] = \frac{1}{4}, E[w^l]=0, E[\frac{\delta L}{\delta z^i}]=0\tag{24}$ From (13), (23), (24):
$ \begin{align}\tag{25} Var[\frac{\delta L}{\delta z^l}] &= \frac{1}{4} n^{l+1}Var[w^l]Var[\frac{\delta L}{\delta z^{l+1}}] + \frac{1}{4} n^{l+1}Var[w^l]Var[\frac{\delta L}{\delta z^{l+1}}]\newline &=\frac{1}{2} n^{l+1}Var[w^l]Var[\frac{\delta L}{\delta z^{l+1}}] \end{align} $
Thus, $Var[\frac{\delta L}{\delta z^2}] = Var[\frac{\delta L}{\delta z^{L+1}}](\prod\limits_{l=2}^{L} \frac{1}{2} n^{l+1} Var[w_l]) \tag{26}$ To keep flowing information, $\frac{1}{2} n^{l+1} Var[w_l] = 1, \forall l. \tag{28}$ From (20), (28) for ReLU, $ \begin{equation} Var[w_l] = \begin{cases}\tag{29} \frac{2}{n^l}, & \text{feedforward case} \newline \frac{2}{n^{l+1}}, & \text{backward case} \newline \frac{4}{n^l + n^{l+1}}, & \text{averaged case} \newline \end{cases} \end{equation} $
For PRELU, $ \begin{equation} Var[w_l] = \begin{cases}\tag{30} \frac{2}{(1+a^2)n^l}, & \text{feedforward case} \newline \frac{2}{(1+a^2)n^{l+1}}, & \text{backward case} \newline \frac{4}{(1+a^2)(n^l + n^{l+1})}, & \text{averaged case} \newline \end{cases} \end{equation} $
As you can see at (30), if a=0, it is the ReLU case: if a=1, it is the linear case. ReLU initialization is better than Xavier initialization. Xavier initialization can not take care of 30-layer model.
Preprocess
subtract mean-pixel
Horizontally flipping
Data-augmentation
RGB-shift
s is randomly sampled from the range of [256, 512]
The shortest side of the picture is resized to s
224*224 patch is cropped out randomly
Optimization
SGD: when the error rate gets stagnated, divide the learning rate by 10
Learning rate: 0.01
Minibatch: 128
Momentum: 0.9
Weight decay: 0.0005
Test evaluation
Dense evaluation and multi-view testing on feature maps are combined.
The resized picture is fed onto the model and the feature map from the last convolutional layer, 14*14 window, is pooled by SPP pool. Then pooled feature map is fed onto the latter layer and the scores of all dense sliding windows are averaged. The same process is executed to horizontally flipped images and images at multiple scales. Simply, their socres are averaged.
Dropout
the first two fully-connected layers with probability 0.5
Table 5 is about imagenet single-model result. For test evaluation, it utilizes 10-view evaluation. You can see that the width of a layer is also the matter for the accuracy. Imagenet single model result: Multi-scale, multiview and dense evaluation are used. Imagenet multi-model result: Multi-scale, multiview and dense evaluation are used.










