PyTorch Implementation of Optimizers from scratch.
Note: All implementations are based on the article “An Overview of Gradient Descent Optimization Algorithms” by Sebastian Ruder.
Gradient descent is a way to minimize an objective function J(θ) parameterized by a model’s parameters(θ) by updating the parameters in the opposite direction of the gradient of the objective function ∇θJ(θ) w.r.t. to the parameters. The learning rate η determines the size of the steps we take to reach a (local) minimum. In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley. (Taken from the article cited below).
A simple GIF demonstrating gradient descent on a convex parabolic loss surface, where iterative updates guide the parameters toward the global minimum is given below:

Gradient descent on a convex loss surface. The ball rolls downhill and converges to the global minimum.
There are 3 main variants of gradient descent and they differ in the amount of data used to compute the gradient of the objective function.
BGD computes the gradient of the cost function w.r.t. to the parameters θ for the entire training dataset. We perform an update in the direction opposite to the gradient of the loss function and the learning rate(η) determines how large of an update we perform.
Update rule:
Batch gradient descent will converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces.
It can be very slow and infeasible for datasets that do not fit into the system memory (as the entire dataset must be available to compute gradients).
There are two problems associated with batch gradient descent, one being that it has strong convergence guarantees for convex objectives, but in non-convex settings it can get stuck at saddle points or poor local minima due to the complex geometry of the loss landscape. The other one being that, it performs redundant computations for similar datapoints in a large dataset for every parameter update.
These limitations motivate stochastic gradient descent (SGD), which mitigates these issues by using stochastic estimates of the gradient. In SGD, parameters are updated using individual training samples, resulting in faster updates, improved scalability, and support for online learning.
Does using a random vector from the training sample always help move towards convergence? The answer is mostly yes. This is because the SGD gradient is unbiased.
Why is the SGD gradient unbiased?
Let the stochastic gradient for a randomly sampled data point (i) be
where the index (i) is sampled uniformly from ({1,...,N}).
Taking expectation over the random choice of (i)
So, on average, SGD points in the exact same direction as batch gradient descent, even though each individual update uses only a single data point.
All implementations are based on the paper 'An overview of gradient descent optimization algorithms' by Sebastian Ruder.
@article{ruder2016overview,
title={An overview of gradient descent optimization algorithms},
author={Ruder, Sebastian},
journal={arXiv preprint arXiv:1609.04747},
year={2016},
url={https://arxiv.org/abs/1609.04747}
}