Mathematical Analysis of Gradient Descent Optimizaion Algorithms.

(1) Abstarct :
Gradient descent optimization algorithms, while increasingly popular, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by. This article aims to provide the reader with intuitions with regard to the behaviour of different algorithms that will allow them to put them to use. In the course of this blog, we look at different variants of gradient descent, summarize challenges, introduce the most common optimization algorithms.
(2) Introduction :
Gradient Descent(GD) is one of the most common optimization algorithm in every state-of-the-art Deep Learning library(e.g. lasagne’s , caffe’s , and keras’ documentation).
GD is a technique to minimise the objective function:

The objective function is function of model’s parameters {θ : θ ∈ R^d} .GD minimizes the objective function by updating the parameters in the direction opposite to the direction of the gradient of the objective function .

In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley.
(3)Gradient Descent Variants:
There are three diferent variants of the Gradient Descent Algorithm , depending upon how many data is being used to calculate the objective function.Depending upon the amount of data , we make a trade-off between the accuracy and the time taken by an update.
3.1)Batch Gadient Dscent(BGD) :
Vanilla GD a.k.a Batch Gradient Descent uses the whole training dataset to calculate the objective function for an epoch and computes the gradient of the cost function w.r.t the parameters of the entire dataset .

In code BGD looks like this :
for i in range ( nb_epochs ):
>> params_grad = evaluate_gradient ( cost functioon(params) )
>> params = params — learning_rate * params_grad
Disadvantage(s) of BGD:
(a) As we need to calculate the gradients for the whole dataset to perform just one update, batch gradient descent can be very slow and is intractable for datasets that do not fit in memory.
(b) BGD also does not allow us to update any hyper parameters on-the-fly (i.e. during the training of any epoch).
3.2)Stochastic Graient Descent(SGD):
Stochastic gradient descent (SGD) in contrast performs a parameter update for each training example :

In code , it looks like :
for i in range ( nb_epochs ):
>> data = np . random . shuffle ( data )
>>for example in data :
>>params_grad = evaluate_gradient (loss_function(params))
>>params = params — learning_rate * params_grad
Disadvantage(s) of SGD :
(a)As , there is an update for each random data in the training example , the updtes are random in nature .

Due to this frequent updates for each example , there is more variance in values of updated parameters , this makes an unstable convergence.
3.3)Mini -Batch Gradient Descent(MBGD) :
MBGD is a intermediate approach between BGD and SGD . It performs an update for every mini-batch of n
training examples:

In code, instead of iterating over examples, we now iterate over mini-batches of size 50:
for i in range ( nb_epochs ):
np . random . shuffle ( data )
for batch in get_batches ( data , batch_size =50):
>params_grad = evaluate_gradient ( cost_function(params) )
> params = params — learning_rate * params_grad
Thus , MBGD is the best approach among all other SGD algorithms : (a)reduces the variance of the parameter updates, which can lead to more stable conver-gence; (b)as , there is no need to pass the whole dataset to calculate the cost function , make it more efficient algorithm than other two GD algorithms .
~~~~~Adaptive Optimization Technique(AOT):~~~~~
‘Adaptive’ =>adjusting to the magnitude of the update value according to the characterstics of the current dataset .
Motivation for AOT :
One of the prominent disadvantages for vanilla Gradient descent is :
Another key challenge of minimizing highly non-convex error functions common for neural networks is avoiding getting trapped in their numerous suboptimal local minima. Dauphin et al. argue that the difficulty arises in fact not from local minima but from saddle points,(i.e. points where one dimension slopes up and another slopes down).

These saddle points are usually surrounded by a plateau of the same error, which makes it notoriously hard for SGD to escape, as the gradient is close to zero in all dimensions.That means :
That implies that :
So ,
So , there will be no update in the parameters around the saddle points .
=> The parameters will oscilate around the saddle point.
To get out of that saddle point , there must exist some ways to get out of it .
“way” =>Inreasing the update value even around the saddle points .

SGD with momentum
SGD with momentum is a modified SGD , where we add an off-set value to that of update-value of the vanilla SGD .
Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations . It does this by adding a fraction γ of the update vector of the past time step[the off-set value] to the current update vector.
Current update value at t th iteration :
So ,the updated parameter value :
Advantage(s):
(a) If the gradient of the the current loss function is in the same direction as that of the update value of the past time, then , the magnitude of current update value increases in that direction.

To increase , the update value by the same amount , ‘with momentum ’ techniques moves more than that of the ‘without momentum’.
This implies that there will be less oscilation in the parameter space.
=>Faster convergence due to less variance in the update value .

(b)In the saddle points although the gradients are zero ,but the momentum value is not zero , that implies that there will be a considerable amount of update value even at the saddle points ,that will eventually , takes us out of the dead trap of ravine at the saddle points .
RMSprop:
Full form of the abbreviation is : Root Mean Squared Propagation .
Unlike that of the ‘SGD with momentum’ , there is no offset value but the learning rate is adjusted according to the dataset .
RMSprop is an unpublished, adaptive learning rate method proposed by Geoff Hinton in Lecture 6e of his Coursera Class
Algorithm:
For brevity,
g_{t,i}
to be the gradient of the objective function
w.r.t. to the parameter θ_{i} at time step t
:
RMSprop as well divides the learning rate by an exponentially decaying average of squared gradients.

So ,the updated parameter value will be :

Advantage(s):
If the gradint at t th iteration is less than (t-1)th iteration ,i.e. g_{t}<g_{t-1}
, then , the effective learning rate will increase .
=>This will increase the update value .
So ,the updated parameters will come out of the saddle point dead trap .
Adam(Adaptive Moment Estimation) :
This is an hybrid method between SGD with mometum and RMSprop .
In addition to storing an exponentially decaying average of past squared gradients v_t
like RMSprop, Adam also keeps an exponentially decaying average of past gradients m_t
, similar to momentum:

m_t and v_t are the first and second moment of the gradient g_t ,that’s how the name come from .
They then use these to update the parameters just as we have seen in Adadelta and RMSprop, whichyields the Adam update rule:

where ,

Adam is most suitable optimization technique for Non-convex objective function .
Conclusion :
Besides these optimization techniques ,there are Nadam , AdaDelta ,AdaMax, etc…
But ,which one has to use ,completely depends upon the dataset .But ,In recent time Adam becimes a go-to optimization technique for Deep-Learning .