Deep Learning Basic Cheatsheet: Basis (Node, Layer, Activation Function), Feed Forward, Back Propagation, Optimization (Stochastic Gradient Descent), Regulartion (Dropout), Batch Normalization, Gradient Vanishing and Exploding.
As we all know, deep learning - neural networks are regarded as black boxes because of the complexity. And it usually takes long time to tune the model. So here I’d like to briefly summarize some basic stuff that necessary for playing with neural networks. I think Andrew Ng’s course on coursera is a nice choice if you don’t have too much experience but want to start. Therefore some of the materials are from that course series. Here is the link.
1. Notations and Terminology
- Components
No matter what kind of neural networks, all of them have input layer, hidden layers and output layer. Each layer contains a few nodes, or called units. A general form to represent the relationship between the input and output of \(i\)th layer is like this
\[
Z^{[i]} = W^{[i]}Z^{[i-1]} + b^{[i]},
\]
where \(Z^{[i]}\) is the output (a vector) of \(i\)th layer (also the input of \((i+1)\)th layer), we can denote the input layer as 0th layer, then \(Z^{[0]}\) is just \(X\). The shape of the \(X\) is \(m\times p\) with \(m\) observations and \(p\) features; \(W^{[i]}\) is the weights of the connections between \((i-1)\)th layer and \(i\)th layer with shape \(n_{i-1}\times n_{i}\), \(n_{i}\) is the number of units in \(i\)th layer, \(b^{[i]}\) is the bias term.
- Activation functions
Activation function is a function \(g\) to be applied on the output of each layer, then do a simple change we have
\[
A^{[i]} = g\left(W^{[i]}Z^{[i-1]} + b^{[i]}\right).
\]
Common activation functions include: sigmoid, tanh, ReLU (rectified linear unit), Leaky ReLU.
- Loss functions
Loss function \(L(y, \hat{y})\) measure how good the prediction is, the lower the better. For regression, the common loss functions are MSE (mean squared error) loss, Huber loss; for classification, the common loss functions are Log loss (cross entropy), Hinge loss. More loss functions could be find here.
- Forward-propagation
The process that propagate the inputs forward through the network to get outputs (predictions), loss function could be evaluated after the outputs are generated. For example we have a neural network with one hidden layer, a forward-propagation process includes
\[
\begin{aligned}
Z^{[1]} &= W^{[1]}X + b^{[1]}, \\
\hat{y} &= A^{[1]} = g\left(Z^{[1]}\right). \\
\end{aligned}
\]
- Back-propagation
The process that propagate from the output to input, \(W\) and \(b\) are updated by calculate the gradients during the propagation. For example we have a neural network with one hidden layer, the gradients of \(W\) and \(b\) are given by
\[
\begin{aligned}
\frac{\partial L(\hat{y}, y)}{\partial W^{[1]}} = \frac{\partial L(\hat{y}, y)}{\partial A^{[1]}} \frac{\partial A^{[1]}}{\partial Z^{[1]}} \frac{\partial Z^{[1]}}{\partial W^{[1]}},\\
\frac{\partial L(\hat{y}, y)}{\partial b^{[1]}} = \frac{\partial L(\hat{y}, y)}{\partial A^{[1]}} \frac{\partial A^{[1]}}{\partial Z^{[1]}} \frac{\partial Z^{[1]}}{\partial b^{[1]}}.
\end{aligned}
\]
2. Optimization
- Mini-batch gradient descent
In regular gradient descent, the update of weights are given by
\[
\begin{aligned}
W\leftarrow~& W -\alpha\frac{\partial L(\hat{y}, y)}{\partial W},\\
b\leftarrow~& b -\alpha\frac{\partial L(\hat{y}, y)}{\partial b},\\
\end{aligned}
\]
where \(\alpha\) is the learning rate.
It takes a long time to train a neural network if we have a large training data set (usually larger than hundred thousand), specifically, doing backpropagation on the whole training set. Therefore one way to speed up the training is divide the data into equal size subgroups and train the model on those subgroups one by one until the whole training set is used. The size of each group is called batch size. When batch size is \(m\), the whole training set is used. A single pass through the training set is called one epoch. By plotting the number of iteration (epoch) and loss, the loss will not keep decreasing in each step, but the overall trend is decreasing, mini-batch training will introduce randomness.
- Gradient descent with Momentum
The update of weights is given by
\[
\begin{aligned}
V_{W}\leftarrow~& \beta V_{W} - (1 - \beta)\frac{\partial L(\hat{y}, y)}{\partial W},\\
V_{b}\leftarrow~& \beta V_{b} - (1 - \beta)\frac{\partial L(\hat{y}, y)}{\partial b},\\
W\leftarrow~& W -\alpha V_{W},\\
b\leftarrow~& b -\alpha V_{b},\\
\end{aligned}
\]
where a exponential weighted average is applied on gradients, \(V_{w}\) and \(V_{d}\) are all initialized as 0.
- RMSprop
Root mean squared prop also revise the update equations,
\[
\begin{aligned}
S_{W}\leftarrow~& \beta S_{W} - (1 - \beta)\left(\frac{\partial L(\hat{y}, y)}{\partial W}\right)^2,\\
S_{b}\leftarrow~& \beta S_{b} - (1 - \beta)\left(\frac{\partial L(\hat{y}, y)}{\partial b}\right)^2,\\
W\leftarrow~& W -\alpha \frac{\frac{\partial L(\hat{y}, y)}{\partial W}}{\sqrt{S_{W}}},\\
b\leftarrow~& b -\alpha \frac{\frac{\partial L(\hat{y}, y)}{\partial b}}{\sqrt{S_{b}}},\\
\end{aligned}
\]
where \(S_{w}\) and \(S_{d}\) are all initialized as 0.
- Adam
Adaptive moment estimation is the combination of momentum and RMSprop, the update of equations are now
\[
\begin{aligned}
W\leftarrow~& W - \alpha\frac{V_{W}}{\sqrt{S_{W}} + \epsilon},\\
b\leftarrow~& b - \alpha\frac{V_{b}}{\sqrt{S_{b}} + \epsilon},
\end{aligned}
\]
the exponential weighted average weights are \(\beta_{1}\) and \(\beta_{2}\) respectively for momentum and RMSprop.
3. Regularization
- L1-norm and L2-norm
L1-norm and L2-norm corresponding to add squared term and absolute value term into the loss function:
\[
\begin{aligned}
\text{(L1-norm): }& + \lambda\sum^{L}_{i=1}\sum^{n_{l}}_{j=1} |W_{ij}| \\
\text{(L2-norm): }& + \lambda\sum^{L}_{i=1}\sum^{n_{l}}_{j=1} W_{ij}^2
\end{aligned}
\]
L2-norm is more likely to be used in my practice. Large weights will have heavy penalty, therefore this regularization will lead to smaller weights. Also note we cannot get explicit gradients for L1-norm.
- Dropout
Eliminate some connections between layers and layers randomly (\(W_{ij}=0\)), intuitively some connections are “dropped”. This is implemented by set a drop rate \(p\) or keep rate \((1-p)\). Usually we will update the weights by run through the whole training set multiple times (multiple epochs), each time the drop out connections are different, and the output of the layer will be updated by dividing the keep rate, this make sure the expectation of the output remains same - since some weights are dropped, the expectation of output is lower. This is also called inverted dropout. In that way, for testing data, no dropout will be applied.
- Data Augmentation
This means add more training data to the model - not collect more data. For example, for image data, we can do some simple operations including flip, rotation, scaling, add noise to images, and labels are not changed; for text data, we can translate the text to other language and translate back - some form might be changed but the labels are also not changed.
- Early Stopping
This requires to split a develop / validation / set out of training data, only train the model on the rest of the data, and evaluate the model on the hold-out data set and monitor the performance for each iteration. The training should be stopped if the loss on the hold-out set are not decreasing for a certain number of rounds.
- Batch Normalization
Seriously this is not regarded as regularization, but it could be viewed to have regularization effect. BN denotes to normalize the intermediate value of the network, give the output of \(i\)th layer \(Z^{[i]}\), we calculate
\[
\mu = \frac{1}{m}\sum^{m}_{j=1}Z^{[i]}_{j},~ \sigma^{2} = \frac{1}{m}\sum^{m}_{j=1} \left(Z^{[i]}_{j} - \mu\right)^{2}
\]
then normalize \(Z^{[i]}\)
\[
Z^{[i]}_{\text{norm}} = \gamma\frac{Z^{[i]} - \mu}{\sqrt{\sigma^{2} + \epsilon}} + \beta,
\]
where \(\epsilon\) is a very small number to avoid dividing 0 issue.
Usually this normalization step is done before applying the activation function. This could make later layers’ weights be more robust to the changes of the weights in the earlier layers. If we use mini-batch training, this also add some noise to the data since it is normalized by “local mean” and “local standard deviation”, that’s why it has some regularization effect.
4. Other Terminology
- Vanishing / Exploding gradients
This is usually happened in deep neural networks, intuitively a number larger than 1 will be very large by taking a high power; a number smaller than 1 will be very small by taking a high power. This will make the training very difficult. A partial solution is to initialize the weights very carefully.
- Hyper-parameters
Hyper-parameters denote to parameters that will not change during the training process. In summary, the hyper-parameters we go through from the above section including:
- structure: number of layers, number of units in layers, activation functions;
- optimization algorithms: learning rate, learning decay type, batch size, number of training epochs, exponential weighted average weights in momentum/RMSprop/Adam;
- regularization: penalty term of L1-norm and L2-norm, dropout rate, adding batch normalization
This is general case, when we have different type of networks we will have more options.
5. References
- Coursera Notes,