Mathematical Analysis of Gradient Descent Optimizaion Algorithms.

7 min readJun 27, 2021

(1) Abstarct :

Gradient descent optimization algorithms, while increasingly popular, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by. This article aims to provide the reader with intuitions with regard to the behaviour of different algorithms that will allow them to put them to use. In the course of this blog, we look at different variants of gradient descent, summarize challenges, introduce the most common optimization algorithms.

(2) Introduction :

Gradient Descent(GD) is one of the most common optimization algorithm in every state-of-the-art Deep Learning library(e.g. lasagne’s , caffe’s , and keras’ documentation).

GD is a technique to minimise the objective function:

is an objective unction.

The objective function is function of model’s parameters {θ : θ ∈ R^d} .GD minimizes the objective function by updating the parameters in the direction opposite to the direction of the gradient of the objective function .

**The ball(parameter[x , here]) moves in the oposite to the *gradient direction(*upwards*) .**

In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley.

(3)Gradient Descent Variants:

There are three diferent variants of the Gradient Descent Algorithm , depending upon how many data is being used to calculate the objective function.Depending upon the amount of data , we make a trade-off between the accuracy and the time taken by an update.

3.1)Batch Gadient Dscent(BGD) :

Vanilla GD a.k.a Batch Gradient Descent uses the whole training dataset to calculate the objective function for an epoch and computes the gradient of the cost function w.r.t the parameters of the entire dataset .

eta is the learning rate and J is the cost function for the entire training dataset.

In code BGD looks like this :

for i in range ( nb_epochs ): >> params_grad = evaluate_gradient ( cost functioon(params) ) >> params = params — learning_rate * params_grad

Disadvantage(s) of BGD:

(a) As we need to calculate the gradients for the whole dataset to perform just one update, batch gradient descent can be very slow and is intractable for datasets that do not fit in memory.

(b) BGD also does not allow us to update any hyper parameters on-the-fly (i.e. during the training of any epoch).

3.2)Stochastic Graient Descent(SGD):

Stochastic gradient descent (SGD) in contrast performs a parameter update for each training example :

x^(i):training example ; y^(i) :corresponding label .

In code , it looks like :

for i in range ( nb_epochs ): >> data = np . random . shuffle ( data ) >>for example in data : >>params_grad = evaluate_gradient (loss_function(params)) >>params = params — learning_rate * params_grad

Disadvantage(s) of SGD :

(a)As , there is an update for each random data in the training example , the updtes are random in nature .

Due to this frequent updates for each example , there is more variance in values of updated parameters , this makes an unstable convergence.

3.3)Mini -Batch Gradient Descent(MBGD) :

MBGD is a intermediate approach between BGD and SGD . It performs an update for every mini-batch of n training examples:

n: number of training examples for one batch .

In code, instead of iterating over examples, we now iterate over mini-batches of size 50:

for i in range ( nb_epochs ):
np . random . shuffle ( data )
for batch in get_batches ( data , batch_size =50):
>params_grad = evaluate_gradient ( cost_function(params) )
> params = params — learning_rate * params_grad

Thus , MBGD is the best approach among all other SGD algorithms : (a)reduces the variance of the parameter updates, which can lead to more stable conver-gence; (b)as , there is no need to pass the whole dataset to calculate the cost function , make it more efficient algorithm than other two GD algorithms .

~Adaptive Optimization Technique(AOT):~

‘Adaptive’ =>adjusting to the magnitude of the update value according to the characterstics of the current dataset .

Motivation for AOT :

One of the prominent disadvantages for vanilla Gradient descent is :

Another key challenge of minimizing highly non-convex error functions common for neural networks is avoiding getting trapped in their numerous suboptimal local minima. Dauphin et al. argue that the difficulty arises in fact not from local minima but from saddle points,(i.e. points where one dimension slopes up and another slopes down).

**Red dot point** is the **saddle point**. **The local points around the saddle point forms the palteu region .**

These saddle points are usually surrounded by a plateau of the same error, which makes it notoriously hard for SGD to escape, as the gradient is close to zero in all dimensions.That means :

derivative of the loss function is 0 in all dimensions.

That implies that :

update value = 0 ; eta is the learning rate

So ,

So , there will be no update in the parameters around the saddle points .

=> The parameters will oscilate around the saddle point.

To get out of that saddle point , there must exist some ways to get out of it .

“way” =>Inreasing the update value even around the saddle points .

SGD with momentum

SGD with momentum is a modified SGD , where we add an off-set value to that of update-value of the vanilla SGD .

Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations . It does this by adding a fraction γ of the update vector of the past time step[the off-set value] to the current update vector.

Current update value at t th iteration :

v_t = current update value .

So ,the updated parameter value :

Advantage(s):

(a) If the gradient of the the current loss function is in the same direction as that of the update value of the past time, then , the magnitude of current update value increases in that direction.

To increase , the update value by the same amount , ‘with momentum ’ techniques moves more than that of the ‘without momentum’.

This implies that there will be less oscilation in the parameter space.

=>Faster convergence due to less variance in the update value .

In contour space of loss vs parameters ,**’with momentum’ is moving in more pace than that of the ‘without momentum’**

(b)In the saddle points although the gradients are zero ,but the momentum value is not zero , that implies that there will be a considerable amount of update value even at the saddle points ,that will eventually , takes us out of the dead trap of ravine at the saddle points .

RMSprop:

Full form of the abbreviation is : Root Mean Squared Propagation .

Unlike that of the ‘SGD with momentum’ , there is no offset value but the learning rate is adjusted according to the dataset .

RMSprop is an unpublished, adaptive learning rate method proposed by Geoff Hinton in Lecture 6e of his Coursera Class

Algorithm:

For brevity,

g_{t,i} to be the gradient of the objective function w.r.t. to the parameter θ_{i} at time step t:

RMSprop as well divides the learning rate by an exponentially decaying average of squared gradients.

So ,the updated parameter value will be :

epsilon is a constant ,nearly equal to 10^-7.

Advantage(s):

If the gradint at t th iteration is less than (t-1)th iteration ,i.e. g_{t}<g_{t-1} , then , the effective learning rate will increase .

=>This will increase the update value .

So ,the updated parameters will come out of the saddle point dead trap .

Adam(Adaptive Moment Estimation) :

This is an hybrid method between SGD with mometum and RMSprop .

In addition to storing an exponentially decaying average of past squared gradients v_t like RMSprop, Adam also keeps an exponentially decaying average of past gradients m_t , similar to momentum: