Optimisation Algorithms in Deep Learning
In this blog, I aim to explain different optimisation algorthims which are widely used in training deep learning models.
Training a deep learning model is a highly iterative process. There are millions of paramterers to learn using the infamous backpragation technique. If we do not choose the optimisation algorithm wisely, it ends up in taking lots and lots of time.
What is optimization algorithm?
In simple terms, it is a mathematical way of improving the predictions of our model by minimising the error (cost) iteratively untill we reach to a satisfactory state.
Revisiting Gradient Descent
In the process of training, we forward pass our training data (X) through the network. At the end of last layer, we get the computed class using the weights and biases, say Y_pred. Then we calculate the loss (J) based on some cost function using our Y_pred and Y_actual.
- Let’s say that loss is J(w).
- Then the next step is to calculate the gradient of J(w) with respect to weights and biases. This is done by backpropagation which essentially involved chain rule of differentiation.
- Once we have the gradients, we can update the weights and biases using the formula below.
Now let us see some famous optimization algorithms.
- Batch Gradient Descent
In eq. 1 above, we iterate on all the training points (j) from 0 to N to find the sum of gradients. Here, we process the entire training set all at same time. If we have 1000 training points then we process all 1000 training points at the same time to complete 1 iteration. So ideally the cost should reduce with each iteration. In other words, the cost vs iteration curve should be smooth.
2. Mini Batch Gradient Descent
In mini batch, we create mini batches out of training set and each batch has a corresponding target batch as well. For example, if we have 1000 training points then we can create 10 batches of 100 points each. Then in each iteration we pass just one batch, unlike the batch GD where we passed complete data in one iteration. Since we have 10 batches, we will be needing 10 iteration to complete a full coverage of training data. In this case, for one iteration, we only process 100 points and in next iteration a different set of 100 points. So the cost convergence will not be smooth as was in case of batch GD.
The reason for the noise is precisely the difference in difficulty level of each batch. The size of mini batch is a hyper parameter which can be tuned. If you put mini batch size = 1, then that gradient descent is known as Stochastic Gradient Descent.
Batch vs Mini Batch vs Stochastic GD
Let us say we have total n points and 1<m<n. Then as you can see below, the green one is convergence trend of mini batch and blue one is for stochastic and red one is for batch.
Batch descent is really smooth whereas as stochastic has a lot of noise and deviation while converging. Hence the mini batch descent is the best option here.
The main disadvantage with stochastic GD is that it is not able to take speed advantage of vectorisation. Whereas the disadvantage with batch GD is that it takes too much time per iteration as it process all the data at one go.
To move further and look towards more efficient algorithms, we first need to understand a bit about Exponentially Weighted Average.
What is Exponentially Weighted Average?
Given a set of data X = {X1,X2,….Xt,….,Xn}, we find a value Vt corresponding to each data Xt by considering the average of 1/(1 — beta) points preceeding current point.
From the equation above we can see that if we use higher beta that means we are giving more importance to previous points rather than the current point. Higher beta leads to a smooth distribution and a right side shifted curve but less adaptability for newer data.
To calculate this exponentially moving average we just need O(1) space. Initialize Vx = 0 and iterate over all the X and keep updating Vx using Eq. 2.
2. Gradient Descent with Momentum
This is a faster method than simple gradient descent. Here the main idea is to find the exponential moving average of gradients while calculating the derivative dL/dw. As we saw earlier, convergence can be very noisy and hence we want to smooth out the convergence. For that purpose we introduce momentum in our gradients. So our Eq. 1 changes now and we use:
Note: I have not written the equation for bias term but it also follows the same form.
3. RMS Prop
RMS Prop is also a similar optimization technique but with a little improvement. Consider a contour where we are having a very noisy convergence in vertical direction:
We do not want the red convergence but we want a green one. In other words, we want to reduce the vertical oscillation and speed up the horizontal one. So now our equation for descent will be transformed to the following.
Here also we calculate the moving average but this time we perform element wise square of dW. Then we do not simply multiply the computed momentum with learning rate but we divide the gradient with momentum. The reason is to penalize the high oscillation in given direction.
Consider in Fig 5, “Wi” direction has less movement but “Wj” has high movement. Hence, SdWj > SdWi. So, while updating the W, change in Wj is smaller than the change in Wi. This is because we use large SdWj in denominator which makes the overall gradient for Wj smaller. Hence change is small and less deviation in vertical direction. We add epsilon in denominator to avoid zero division.
4. Adam
Adam is a combination of momentum and RMS prop.
In most of the cases, this works better than all the algorithms explained above. But this can be problem specific too. So worth experimenting.
Experiment
I trained a multi layered perceptron on MNIST data. In my case SGD was the clear winner but it came at a cost of increased time for computation.
Note: SGD took only 30 iterations to reach at an excellent result whereas the others have 300 epochs (they are scaled down to 30 for the sake of plotting).
Code can be found on my github: https://github.com/Saurabhraj5162/INM702_CW_MNIST