QAlog: Quantum Momentum Based Gradient Descent

How to escape local minima?

Anonymousket
11 min readApr 27, 2024

In the previous article, we have discussed about the gradient descent algorithm and it’s quantum implementation.

In this article, we discuss about the momentum based Gradient Descent algorithm, which overcomes some of the problems of gradient descent algorithm.

Introduction:

Gradient Descent is an optimization algorithm used to minimize the cost or loss function of a model during training. While Vanilla Gradient Descent is effective, it sometimes faces challenges like slow convergence, oscillations, and difficulties in escaping local minima.

In other ways, Gradient Descent takes a lot of time to navigate regions having a gentle slope, because the gradient in these regions is very small.

Slow Convergence:

Gradient descent can converge slowly, especially in regions with shallow gradients or plateaus. In such cases, the updates to the parameters can be very small, leading to slow progress towards the minimum of the loss function.

Sensitivity to Learning Rate:

The choice of learning rate (α) in gradient descent is critical. A learning rate that is too large can cause the optimization process to overshoot the minimum and diverge, while a learning rate that is too small can result in slow convergence.

Oscillations and Zigzagging:

In regions with high curvature or sharp turns, gradient descent may oscillate or zigzag around the minimum rather than converging smoothly. This can prolong the optimization process and make it less stable.

Local Minima and Saddle Points:

Gradient descent is prone to getting stuck in local minima or saddle points, especially in high-dimensional spaces. These stationary points can hinder progress towards the global minimum of the loss function.

Noisy or Sparse Gradients:

In scenarios where gradients exhibit noise or sparsity, the effectiveness of gradient descent in achieving convergence may be compromised. Noisy gradients introduce inconsistency, resulting in unpredictable updates. On the other hand, sparse gradients offer only partial information, making it difficult to accurately discern the descent direction. Consequently, gradient descent encounters challenges in efficiently navigating the parameter space, hindering its ability to converge towards the optimal solution.

Vanishing or Exploding Gradients:

In deep neural networks, gradient descent can suffer from vanishing or exploding gradients, especially in deep architectures. Vanishing gradients lead to slow training, while exploding gradients can cause the optimization process to diverge.

Momentum-based Gradient Descent, introduces a concept called momentum to address these issues.

As the gradient consistently advances in a singular direction through successive iterations, in accordance with the principles of Newtonian physics, the entity is poised to acquire increased inertia, facilitating swifter movement. By adhering to this principle, maintaining a consistent direction fosters confidence, encouraging the adoption of bigger steps over smaller ones.

we have seen the expression for gradient descent as shown below

Momentum in Optimization:

Momentum is inspired by physics and involves accumulating the past gradients to give the optimization process a sense of inertia. In other words, momentum helps the optimization algorithm build up speed in the directions where the gradient consistently points. This tends in faster convergence and smoother navigation through complex loss landscapes.

The inclusion of a momentum term to the Gradient Descent can be expressed as follows

where β is the momentum parameter 0 < β < 1, where v_t is the velocity vector at iteration t. The velocity vector is a weighted average of the past gradients, where the weight of the past gradients decays exponentially with the factor β. The weight vector is updated by adding the velocity vector to the previous weight vector.

here if we assume, β as mass and v as velocity, then we can consider β * v (mass * velocity) which is momentum(p), p=m⋅v.

when the value of β is zero, then the expression is same as gradient descent.

Mathematical interpretation:

Let’s check the how much previous velocity contributes.

if we replace with the previous velocity

by expanding we get

now further replace the velocity vector with the previous value

by expanding we get

we can get the general form as shown below

where v_0​ is the initial velocity vector, which is usually set to zero. Assuming that β<1, we can see that the first momentum term​ vanishes as t increases, and the velocity vector becomes a weighted sum of the past gradients. The weight of the past gradients decreases exponentially with the factor β, giving more importance to the recent gradients.

we can also say this as exponential decay.

Advantages:

Accelerated Convergence:

Traditional gradient descent often converges slowly, especially in regions of high curvature or near saddle points.

Momentum helps to accelerate convergence by accumulating gradients from past iterations, enabling faster progress towards the minimum.

Shallow Local Minima:

Gradient descent can get stuck in shallow local minima and struggle to escape due to the small step sizes it takes.

Momentum allows the optimizer to traverse through such regions more effectively, enabling it to escape shallow local minima and continue searching for better solutions.

Learning Rate Sensitivity:

Traditional gradient descent requires careful tuning of the learning rate, which can be time-consuming and challenging, especially for complex optimization landscapes.

Momentum reduces the sensitivity to the learning rate, allowing for more robust and stable optimization across a wide range of learning rates.

Escaping Plateaus:

Gradient descent struggles to escape plateaus, where the gradient magnitude is very small, leading to slow progress.

Momentum allows the optimizer to maintain velocity even on flat regions, facilitating faster traversal through plateaus.

python code for momentum based gradient descent

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

a quadratic function

def loss_function(x, y):
return x**2 + y**2

Gradient of the loss function or vector with partial derivatives

def gradient(x, y):
return np.array([2 * x, 2 * y])

Momentum-based Gradient Descent

def momentum_gradient_descent(learning_rate, beta, epochs):
x = np.array([2.0, 2.0]) # Initial parameters
v = np.zeros_like(x)

trajectory = [x]

for epoch in range(epochs):
grad = gradient(x[0], x[1])
# Update velocity using momentum
v = beta * v -learning_rate * grad
x = x + v
trajectory.append(x)

return np.array(trajectory)

Get the required parameters

# Set parameters
learning_rate = 0.1
beta = 0.9
epochs = 50

# Run Momentum-based Gradient Descent
trajectory = momentum_gradient_descent(learning_rate, beta, epochs)

x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
X, Y = np.meshgrid(x, y)
Z = loss_function(X, Y)

Visualize the 3D loss landscape and optimization trajectory

fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(X, Y, Z, cmap='viridis', alpha=0.8)
ax.plot(
trajectory[:, 0],
trajectory[:, 1],
loss_function(
trajectory[:, 0],
trajectory[:, 1]
),
marker='o',
color='red',
label='Momentum-based GD Trajectory'
)
ax.set_xlabel('Parameter 1')
ax.set_ylabel('Parameter 2')
ax.set_zlabel('Loss')
ax.set_title('Momentum-based Gradient Descent in 3D Loss Landscape')
ax.legend()

plt.show()

plot the graph

below is the 2d plot for the momentum based gradient descent

below is the animated graph generated epoch wise.

Quantum Algorithm:

We use qiskit to apply momentum based gradient descent to the quantum data.

import required libraries

import numpy as np
from qiskit import QuantumCircuit
from qiskit_aer import Aer

Define objective function and its gradient

# Objective function and its gradient
def objective_function(x):
return x**2

def gradient_function(x):
return 2*x

initialize the parameters, circuit parameters and quantum circuit

x = 2
learning_rate = 0.1
num_iterations = 10
beta = 0.9 # Momentum coefficient

# Initialize momentum
momentum = 0

# Quantum circuit parameters
num_qubits = 1
shots = 1000

# Initialize quantum circuit
qc = QuantumCircuit(num_qubits, num_qubits)

Define encoding block

# Quantum gate for encoding parameter
def encode_parameter(qc, x):
# Scale the parameter to fit within the range [-1, 1]
scaled_x = (2 * np.clip(x, 0, 1)) - 1
qc.ry(2 * np.arcsin(np.sqrt((scaled_x + 1) / 2)), 0)

measurement and simulator block

# Quantum gate for measuring objective function
def measure_objective(qc):
qc.measure(0, 0)

# Get simulator backend
simulator = Aer.get_backend('qasm_simulator')

Gradient descent iteration

# Run gradient descent iterations
for i in range(num_iterations):
# Update momentum
momentum = beta * momentum - learning_rate * gradient

# Update parameter using momentum-based gradient descent
x += momentum

# Reset quantum circuit for each iteration
qc.data = []

# Encode parameter into quantum state
encode_parameter(qc, x)

# Measure objective function
measure_objective(qc)

# Execute the circuit and obtain measurement results
job = simulator.run(qc, shots=shots)
result = job.result()
counts = result.get_counts(qc)

# Estimate objective function value
estimated_value = (counts.get('0', 0) - counts.get('1', 0)) / shots

print(f"""Iteration {i+1}: Parameter = {x},
Objective function value = {estimated_value}""")

# Print final result
print(f"Optimized parameter value: {x}")

Results:

Iteration 1: Parameter = 1.6,
Objective function value = -1.0
Iteration 2: Parameter = 0.9199999999999999,
Objective function value = -0.848
Iteration 3: Parameter = 0.12399999999999967,
Objective function value = 0.738
Iteration 4: Parameter = -0.6172000000000005,
Objective function value = 1.0
Iteration 5: Parameter = -1.1608400000000008,
Objective function value = 1.0
Iteration 6: Parameter = -1.4179480000000007,
Objective function value = 1.0
Iteration 7: Parameter = -1.3657556000000004,
Objective function value = 1.0
Iteration 8: Parameter = -1.04563132,
Objective function value = 1.0
Iteration 9: Parameter = -0.5483932039999998,
Objective function value = 1.0
Iteration 10: Parameter = 0.008799741200000377,
Objective function value = 0.978


Optimized parameter value: 0.008799741200000377

Now let’s apply the quantum circuit using qiskit with iris dataset.

import required modules

import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import MinMaxScaler
from qiskit import QuantumCircuit
from qiskit_aer import Aer

load the iris data using sklearn

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

Normalize the data using Min Max

# Normalize feature values
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)

Define Objective and gradient function

# Define objective function and its gradient
def objective_function(w):
# Simple objective function:
# sum of squared differences between actual and predicted class labels
predicted_labels = np.argmax(np.dot(X_normalized, w), axis=1)
return np.sum((predicted_labels - y)**2)

def gradient_function(w):
# Gradient of the objective function
predicted_labels = np.dot(X_normalized, w)
error = predicted_labels - np.eye(3)[y]
gradient = 2 * np.dot(X_normalized.T, error)
return gradient

Initialize parameters, quantum parameters and quantum circuit

# Initialize parameters
num_features = X_normalized.shape[1]
num_classes = len(np.unique(y))
w = np.random.rand(num_features, num_classes)
learning_rate = 0.01
num_iterations = 10
beta = 0.9 # Momentum coefficient

# Initialize momentum
momentum = np.zeros_like(w)

# Quantum circuit parameters
num_qubits = num_features
shots = 1000

# Initialize quantum circuit
qc = QuantumCircuit(num_qubits, num_qubits)

define encode function and measurement

# Quantum gate for encoding parameters
def encode_parameters(qc, w):
for i in range(num_qubits):
for j in range(num_classes):
# Clip values to ensure they fall within [0, 1] range
clipped_value = np.clip(w[i, j], 0, 1)
# Apply arcsin gate
qc.ry(2 * np.arcsin(np.sqrt(clipped_value)), i)

# Quantum gate for measuring objective function
def measure_objective(qc):
qc.measure_all()

gradient iterations

# Run gradient descent iterations
for i in range(num_iterations):
# Compute gradient
gradient = gradient_function(w)

# Update momentum
momentum = beta * momentum - learning_rate * gradient

# Update parameters using momentum-based gradient descent
w += momentum

# Reset quantum circuit for each iteration
qc.data = []

# Encode parameters into quantum state
encode_parameters(qc, w)

# Measure objective function
measure_objective(qc)

# Execute the circuit and obtain measurement results
simulator = Aer.get_backend('qasm_simulator')
job = simulator.run(qc, shots=shots)
result = job.result()
counts = result.get_counts(qc)

# Estimate objective function value
estimated_value = (
counts.get('0' * num_qubits, 0) - counts.get('1' * num_qubits, 0)
) / shots

print(f'''Iteration {i+1}: Parameters = {w},
Objective function value = {estimated_value}''')

# Print final result
print(f"Optimized parameters: {w}")

Results

Iteration 1: Parameters = [[-0.29964495 -0.31073152 -0.44592472]
[ 0.27494478 -0.36202519 -0.42434317]
[-0.39818639 0.12698513 -1.25460153]
[-0.22939201 -0.37804032 -0.41751743]],
Objective function value = 0.0
Iteration 2: Parameters = [[ 0.09679955 0.25386389 0.84328198]
[ 1.31812401 0.13279514 0.28474462]
[-0.19391962 0.78700958 0.36008895]
[-0.06217427 0.20745267 1.24764907]],
Objective function value = 0.0
Iteration 3: Parameters = [[ 0.05246948 0.19013628 0.6460111 ]
[ 2.04158037 0.12429927 -0.20478765]
[-0.49444896 0.74702635 0.29716278]
[-0.4048776 0.06536049 1.2575407 ]],
Objective function value = 0.0
Iteration 4: Parameters = [[-0.25912711 -0.2490117 -0.43751992]
[ 2.36013153 -0.19721495 -1.30189119]
[-1.02072521 0.29829862 -0.80427938]
[-0.97123206 -0.50932289 0.24296033]],
Objective function value = 0.0
Iteration 5: Parameters = [[ 0.07882118 0.25368218 0.69261025]
[ 2.86532938 0.2186069 -0.44663841]
[-0.67821041 0.94029062 0.55071991]
[-0.66433912 -0.01695024 1.68420032]],
Objective function value = 0.0
Iteration 6: Parameters = [[-0.0197069 0.13960546 0.37909869]
[ 2.6538278 0.1183774 -0.56507425]
[-0.69118038 0.90111653 0.19743774]
[-0.70367342 -0.2208152 1.42445917]],
Objective function value = 0.0
Iteration 7: Parameters = [[-0.28416977 -0.20640208 -0.47120535]
[ 2.02407043 -0.18304538 -0.9715226 ]
[-0.78328385 0.61551574 -0.82029383]
[-0.81584144 -0.68395919 0.5034323 ]],
Objective function value = 0.0
Iteration 8: Parameters = [[-0.00496432 0.22863605 0.52302355]
[ 1.70380373 0.13845403 0.166565 ]
[-0.2073544 1.23053489 0.23156122]
[-0.26045133 -0.25942147 1.6479053 ]],
Objective function value = 0.0
Iteration 9: Parameters = [[-0.13294644 0.07529964 0.14575957]
[ 1.10877931 -0.00594372 0.17369808]
[-0.11171671 1.18474846 -0.27100427]
[-0.18824884 -0.50171798 1.24432998]],
Objective function value = 0.0
Iteration 10: Parameters = [[-0.3382088 -0.19755416 -0.50068498]
[ 0.65044262 -0.23019265 -0.13908908]
[-0.17645434 1.00697057 -1.04938551]
[-0.28461234 -0.88095847 0.56827026]],
Objective function value = 0.0


Optimized parameters: [[-0.3382088 -0.19755416 -0.50068498]
[ 0.65044262 -0.23019265 -0.13908908]
[-0.17645434 1.00697057 -1.04938551]
[-0.28461234 -0.88095847 0.56827026]]

Conclusion:

Momentum-based Gradient Descent is a powerful optimization algorithm that addresses some of the challenges faced by basic Gradient Descent. Its ability to build up speed and reduce oscillations contributes to faster convergence and more stable training of machine learning models.

update: 26/04/2024 — added plots to highlight the comparison between gradient descent and momentum based gradient descent in the advantages section.

reference: IIT-M course, Wikipedia, google colab, stack exchange, open sources.

--

--