Batch Variants — Gradient Descent

How to work with large data!

Anonymousket
5 min readMay 14, 2024

In the previous articles, we discussed about gradient descent and momentum based gradient descent quantum algorithms.

In this article, we discuss about the improved version of Gradient Descent to handle the huge data sets.

Introduction:

Gradient Descent serves as a fundamental approach for optimizing the parameters of a model to minimize a cost function. The basic idea is to iteratively update the model parameters in the opposite direction of the gradient of the cost function. But in a real-world scenarios with ever growing datasets, it has two major limitations:

  • Calculating derivatives for the entire dataset is time consuming,
  • Memory required is proportional to the size of the dataset.

These computations are typically done in memory. so, the larger the dataset, the greater the memory requirements.

This is where Stochastic Gradient Descent comes in!

SGD:

Stochastic Gradient Descent introduces a key modification to the traditional Gradient Descent algorithm by updating the parameters using only a small, randomly-selected subset of the training data at each iteration. This mini-batch approach significantly reduces the computational requirements, allowing for faster convergence and more efficient model training.

Following are the steps

Step 1: Randomly select the data set of size 1.

Step 2: Select a learning rate α

Step 3: Select initial parameter values x as the starting point

The below expression is minor modification to the Gradient Descent, where m = 1

Step 4: Update all parameters from the gradient of a single training example randomly xj, we can rewrite the above expression as

Step 5: Repeat Step 4 until a local minimum is reached.

python code with a simple function.

import required modules

import numpy as np
import matplotlib.pyplot as plt

define the function or loss function in case of machine learning

# Function to simulate a simple convex loss surface
def loss_function(x):
return 0.5 * (x - 3) ** 2

now, let define the gradient or vector of partial derivatives

# Gradient of the loss function
def gradient(x):
return x - 3

below is the code for Stochastic Gradient Descent algorithm, where simulates a random data point in the given range, then applies update rule.

# Stochastic Gradient Descent algorithm
def stochastic_gradient_descent(learning_rate, epochs):
np.random.seed(42)
x = 0 # Initial parameter value
trajectory = [x]

for epoch in range(epochs):
random_data_point = np.random.uniform(low=-5, high=5)
x = x - learning_rate * gradient(random_data_point)
trajectory.append(x)

return trajectory

initialize and generate the required data

# Set parameters
learning_rate = 0.1
epochs = 50

# Run Stochastic Gradient Descent
trajectory = stochastic_gradient_descent(learning_rate, epochs)

x_values = np.linspace(-5, 5, 100)
y_values = loss_function(x_values)

Plotting the loss surface and the trajectory of SGD

plt.plot(x_values, y_values, label='Loss Surface')
plt.scatter(
trajectory,
[loss_function(x) for x in trajectory],
color='red',
label='SGD Trajectory'
)
plt.title('Stochastic Gradient Descent')
plt.xlabel('Parameter Value')
plt.ylabel('Loss')
plt.legend()
plt.show()

3 dimensional plot for the above code is as shown below

Animated version of the SGD is as shown below

Mini-Batch:

While Batch Gradient Descent or Gradient Descent assures smooth convergence on convex functions and full dataset utilization, Stochastic Gradient Descent(SGD) shines in handling large datasets and faster convergence rates. However, both methods come with trade-offs that can impact computational efficiency and convergence speed.

To strike a balance between these two extremes, a hybrid approach known as Mini-Batch Gradient Descent is useful. This technique brings optimization by leveraging the advantages of both Batch Gradient Descent and Stochastic Gradient Descent while mitigating their drawbacks.

Mini-Batch Gradient Descent operates by dividing the dataset into smaller, equally sized subsets called mini-batches. These mini-batches provide enough data to perform stable parameter updates while remaining small enough to fit into memory efficiently.

Following are the steps

Step 1: Randomly shuffle the data set of size m (such as 16, 32, 64, 128 etc).

Step 2: Select a learning rate α

Step 3: Select initial parameter values x as the starting point

The below expression is minor modification to the Gradient Descent

Step 4: Update all parameters from the gradient of a single training example randomly xj, we can rewrite the above expression as

Step 5: Repeat Step 4 until a local minimum is reached.

python code with a simple function.

import required modules

import numpy as np
import matplotlib.pyplot as plt

define the function or loss function in case of machine learning

# Function to simulate a simple convex loss surface
def loss_function(x):
return x ** 2

now, let define the gradient or vector of partial derivatives

# Gradient of the loss function
def gradient(x):
return 2 * x

below is the code for Stochastic Gradient Descent algorithm, where simulates a random data point in the given range, then applies update rule.

# Mini-Batch Gradient Descent algorithm
def mini_batch_gradient_descent(learning_rate, epochs, batch_size):
np.random.seed(42)
x = 0 # Initial parameter value
trajectory = [x]

for epoch in range(epochs):
random_data_points = np.random.uniform(low=-10, high=10, size=batch_size)
gradients = np.mean([gradient(x) for x in random_data_points])
x = x - learning_rate * gradients
trajectory.append(x)

return trajectory

initialize and generate the required data

# Set parameters
learning_rate = 0.1
epochs = 50
batch_size = 10

# Run Mini-Batch Gradient Descent
trajectory = mini_batch_gradient_descent(learning_rate, epochs, batch_size)

# Plotting the loss surface and the trajectory of Mini-Batch Gradient Descent
x_values = np.linspace(-10, 10, 100)
y_values = loss_function(x_values)

Plotting the loss surface and the trajectory of SGD

plt.plot(x_values, y_values, label='Loss Surface')
plt.scatter(
trajectory,
[loss_function(x) for x in trajectory],
color='red',
label='Mini-Batch Gradient Descent Trajectory'
)
plt.title('Mini-Batch Gradient Descent')
plt.xlabel('Parameter Value')
plt.ylabel('Loss')
plt.legend()
plt.show()

Comparisons between Batch GD, SGD, Mini-batch GD

animated

references: Wikipedia, IIT-M AI/ML course, open source materials.

--

--