Understanding the difficulty of training deep feedforward neural networks
The initialization often dominates the maximum accuracy of the network and the speed of convergence. The good initialization results in the good result, thus we have to be careful about how to initialize. This paper uses 5 layer multi-perceptrons for the experiment and analyses results. Interestingly, different activation functions or the way initializations change the behavior of inside the network (gradients and the ratio of activation values). In the end, this paper suggests and analyses normalized initialization. Normalized initialization is often called Xavier initialization (it derives from the author name). This initialization is designed to simply try to keep the variance of gradients during propagation.
Date: 2010
Initializations
Default initialization
$W_{ij} \sim U[-\frac{1}{\sqrt{n}}, \frac{1}{\sqrt{n}}]$
U: uniform distribution
n: the size of the previous layer (the number of columns of W)
Normalized initialization
$W_{ij} = U[\frac{-\sqrt{6}}{\sqrt{n_i + n_{i+1}}},\frac{\sqrt{6}}{\sqrt{n_i + n_{i+1}}}]$
U: uniform distribution
n: the size of layer
i: layer i
I think normalized initialization can be $Gaussian(0, \frac{2}{n_i + n_{i+1}})$
Architecture
The neural networks with one to five hidden layers, with one thousand hidden units per layer, and with a softmax logistic regression for the output layer.
The cost function is the negative log-likelihood −log P(y|x)
Actication function and the derivatives of themself
Activate functions
$sigmoid(x) = \frac{1}{1 + \mathrm{e}^{x}}$
$tanh(x) = \frac{\mathrm{e}^{x} - \mathrm{e}^{-x}}{\mathrm{e}^{x} + \mathrm{e}^{-x}}$
$softsign(x) = \frac{x}{1 + |x|}$
The derivatives
$\frac{\delta sigmoid}{\delta x} = \frac{\mathrm{e}^{-x}}{(1 + \mathrm{e}^{-x})^2}$
$\frac{\delta tanh}{\delta x} = 1 - tanh(x)^2$
$\frac{\delta softsign}{\delta x} = \frac{1}{(1 + |x|)^2}$
The tail of the derivative of the softsign is of the shape of the quadratic polynomials rather than exponentials, thus gradients flows.
Section 3
The sigmoid non-linearity has been already shown to slow down learning because of its none-zero mean that induces important singular values in the Hessian. Layers start to saturate from bottom to top. According to this paper, this phenomenon is caused by a combination of random initialization and the fact that an hidden unit output of 0 corresponds to a saturated sigmoid and correlates with slow convergence. Note that this phenomenon will not be observed when weights are initialized by unsupervised pre-training. As you can see, the last layer gets lower saturation quickly, thus gradients vanish on the way back and learning might not start at the lower layer. Unfortunately, due to gradient vanishing at lowe layers, a 5 layers network is too deep for this method for this network to start learning. This lower saturation is caused by the tendency of softmax to rely on more biases than activations from the previous layer. b of softmax(b + Wh) is learned quickly, but h varies lots, thus Wh is pushed towards 0 for stabilization by pushing h towards 0. The derivative of sigmoid(0) is low, gradients might not be propagated at lower layer.
This saturation phenomenon needs to be explained in the future.
Different activation functions result in different activation values ratio.
Section 4
Conditional log-likelihood cost function works much better than the quadratic cost. There are clearly more severe plateaus with the quadratic cost.
It has been found that back-propagated gradients after initialization were smaller as one moves from the output layer towards the input layer and the variance of the back-propagated gradients decreases as we go backwards in the network. If normalized initialization is applied, this phenomenon will not be observed.
Calculate normalized initialization backpropagation is like this: $s^i = z^i W^i + b^i$ $z^{i+1} = f(s^i)$ $s^{i+1} = z^{i+1} W^{i+1} + b^{i+1} = f(s^i) W^{i+1} + b^{i+1}$ $\frac{\delta s^{i+1}}{\delta s^i} = \frac{\delta f}{\delta s^i} W^{i+1}$ $\frac{\delta s^{i}}{\delta W^i} = z^{i}$ $\frac{\delta L}{\delta s^i} = \frac{\delta L}{\delta s^{i+1}}\frac{\delta s^{i+1}}{\delta s^i} = \frac{\delta f}{\delta s^i} W^{i+1} \frac{\delta L}{\delta s^{i+1}}$ $\frac{\delta L}{\delta W^i} = \frac{\delta L}{\delta s^{i}}\frac{\delta s^{i}}{\delta W^i} = z^i \frac{\delta f}{\delta s^i} W^{i+1} \frac{\delta L}{\delta s^{i+1}}$ Consider linear function, thus: $\frac{\delta f}{\delta s^i} \approx 1$ Let the variance be Var[ ]: $\mu(x_i)=\mu(W_i) =0, then \, Var[W_i x_i] = Var[W_i]Var[x_i]$ $Var[z^i] = Var[x] \, \prod\limits_{i'=0}^{i'-1} n_{i'} Var[W^{i'}]$ $Var[\frac{\delta L}{\delta s^i}] =Var[\frac{\delta L}{\delta s^d}] \, \prod\limits_{i'=i}^{d} n_{i'+1} Var[W^{i'}]$ $Var[\frac{\delta L}{\delta W^i}] =\prod\limits_{i'=0}^{i'-1} n_{i'} Var[W^{i'}] \, \prod\limits_{i'=i}^{d} n_{i'+1} Var[W^{i'}] \times Var[x] Var[\frac{\delta L}{\delta s^d}] $ To keep information flowing: $\forall(i,i'),\, Var[z^i]=Var[z^{i'}]$ To keep information flowing backward: $\forall (i,i'), \, \frac{\delta L}{\delta s^i} = \frac{\delta L}{\delta s^{i'}}$ Two conditions are defined by two equations above: $\forall i,\, n_i Var[W^i]=1$ $\forall i,\, n_{i+1} Var[W^i]=1$ As a compromise between these two constraints: $\forall i,\, Var[W^i]=\frac{2}{n_i + n_{i+1}}$ The variance of uniform distribution is: $W_i\sim\, U[a,b], \, Var[U[a,b]]=\frac{(b-a)^2}{12}$ Thus, the variance of default initialization is: $W_{i} \sim\, U[-\frac{1}{\sqrt{n_{i-1}}}, \frac{1}{\sqrt{n_{i-1}}}], \, Var[U[-\frac{1}{\sqrt{n_{i-1}}}, \frac{1}{\sqrt{n_{i-1}}}]]=\frac{1}{3n}$ $n_{i-1}Var[W_i] = \frac{1}{3}$ If all n are the same size, you can see that information will be lost on the way, to prevent it: $W\sim\, U[\frac{-\sqrt{6}}{\sqrt{n_i + n_{i+1}}},\frac{\sqrt{6}}{\sqrt{n_i + n_{i+1}}}]$ This initialization is called normalized distribution or Xavier initialization. Confirm the variance: $Var[U[\frac{-\sqrt{6}}{\sqrt{n_i + n_{i+1}}},\frac{\sqrt{6}}{\sqrt{n_i + n_{i+1}}}]] = \frac{(\frac{\sqrt{6}}{\sqrt{n_i + n_{i+1}}} + \frac{\sqrt{6}}{\sqrt{n_i + n_{i+1}}})^2}{12} = \frac{2}{n_i + n_{i+1}}$ Thus, no information loss from the variance point of view while propagating
If normalized initialization, activation values are totally different. The variance of the back-propagated gradients gets smaller if the initialization is not normalized. Normalized initialization tries to solve the problem exactly. Gradients flow backwardly more. Normalized initialization works. The variances are preserved if normalized initialization is applied during training. When compared with figure3, the variances change less. Pre-training works well. I suppose that unsupervised pre-training tries to keep the amount of information at the last layer as much as possible, thus what unsupervised pre-training is doing is intrinsicaly same as normalized initialization. The way of initialization dominates the accuracy
















