# QAlog: Quantum Momentum Based Gradient Descent

## How to escape local minima?

In the previous article, we have discussed about the gradient descent algorithm and it’s quantum implementation.

In this article, we discuss about the momentum based Gradient Descent algorithm, which overcomes some of the problems of gradient descent algorithm.

**Introduction:**

Gradient Descent is an optimization algorithm used to minimize the cost or loss function of a model during training. While Vanilla Gradient Descent is effective, it sometimes faces challenges like slow convergence, oscillations, and difficulties in escaping local minima.

In other ways, Gradient Descent takes a lot of time to navigate regions having a gentle slope, because the gradient in these regions is very small.

*Slow Convergence:*

Gradient descent can converge slowly, especially in regions with shallow gradients or plateaus. In such cases, the updates to the parameters can be very small, leading to slow progress towards the minimum of the loss function.

*Sensitivity to Learning Rate:*

The choice of learning rate (*α*) in gradient descent is critical. A learning rate that is too large can cause the optimization process to overshoot the minimum and diverge, while a learning rate that is too small can result in slow convergence.

*Oscillations and Zigzagging:*

In regions with high curvature or sharp turns, gradient descent may oscillate or zigzag around the minimum rather than converging smoothly. This can prolong the optimization process and make it less stable.

*Local Minima and Saddle Points:*

Gradient descent is prone to getting stuck in local minima or saddle points, especially in high-dimensional spaces. These stationary points can hinder progress towards the global minimum of the loss function.

*Noisy or Sparse Gradients:*

In scenarios where gradients exhibit noise or sparsity, the effectiveness of gradient descent in achieving convergence may be compromised. Noisy gradients introduce inconsistency, resulting in unpredictable updates. On the other hand, sparse gradients offer only partial information, making it difficult to accurately discern the descent direction. Consequently, gradient descent encounters challenges in efficiently navigating the parameter space, hindering its ability to converge towards the optimal solution.

*Vanishing or Exploding Gradients:*

In deep neural networks, gradient descent can suffer from vanishing or exploding gradients, especially in deep architectures. Vanishing gradients lead to slow training, while exploding gradients can cause the optimization process to diverge.

Momentum-based Gradient Descent, introduces a concept called *momentum* to address these issues.

As the gradient consistently advances in a singular direction through successive iterations, in accordance with the principles of Newtonian physics, the entity is poised to acquire increased inertia, facilitating swifter movement. By adhering to this principle, maintaining a consistent direction fosters confidence, encouraging the adoption of bigger steps over smaller ones.

we have seen the expression for gradient descent as shown below

**Momentum in Optimization:**

Momentum is inspired by physics and involves accumulating the past gradients to give the optimization process a sense of *inertia*. In other words, momentum helps the optimization algorithm build up speed in the directions where the gradient consistently points. This tends in faster convergence and smoother navigation through complex loss landscapes.

The inclusion of a momentum term to the Gradient Descent can be expressed as follows

where β is the momentum parameter 0 < *β* < 1, where *v_t* is the velocity vector at iteration *t*. The velocity vector is a weighted average of the past gradients, where the weight of the past gradients decays exponentially with the factor *β*. The weight vector is updated by adding the velocity vector to the previous weight vector.

here if we assume, *β* as mass and *v* as velocity, then we can consider *β ** *v* (mass * velocity) which is momentum(p), p=m⋅v.

when the value of *β* is zero, then the expression is same as gradient descent.

**Mathematical interpretation:**

Let’s check the how much previous velocity contributes.

if we replace with the previous velocity

by expanding we get

now further replace the velocity vector with the previous value

by expanding we get

we can get the general form as shown below

where *v*_*0* is the initial velocity vector, which is usually set to zero. Assuming that *β*<1, we can see that the first momentum term vanishes as t increases, and the velocity vector becomes a weighted sum of the past gradients. The weight of the past gradients decreases exponentially with the factor β, giving more importance to the recent gradients.

we can also say this as ** exponential decay**.

**Advantages:**

*Accelerated Convergence:*

Traditional gradient descent often converges slowly, especially in regions of high curvature or near saddle points.

Momentum helps to accelerate convergence by accumulating gradients from past iterations, enabling faster progress towards the minimum.

*Shallow Local Minima:*

Gradient descent can get stuck in shallow local minima and struggle to escape due to the small step sizes it takes.

Momentum allows the optimizer to traverse through such regions more effectively, enabling it to escape shallow local minima and continue searching for better solutions.

*Learning Rate Sensitivity:*

Traditional gradient descent requires careful tuning of the learning rate, which can be time-consuming and challenging, especially for complex optimization landscapes.

Momentum reduces the sensitivity to the learning rate, allowing for more robust and stable optimization across a wide range of learning rates.

*Escaping Plateaus:*

Gradient descent struggles to escape plateaus, where the gradient magnitude is very small, leading to slow progress.

Momentum allows the optimizer to maintain velocity even on flat regions, facilitating faster traversal through plateaus.

python code for momentum based gradient descent

`import numpy as np`

import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D

a quadratic function

`def loss_function(x, y):`

return x**2 + y**2

Gradient of the loss function or vector with partial derivatives

`def gradient(x, y):`

return np.array([2 * x, 2 * y])

Momentum-based Gradient Descent

`def momentum_gradient_descent(learning_rate, beta, epochs):`

x = np.array([2.0, 2.0]) # Initial parameters

v = np.zeros_like(x)

trajectory = [x]

for epoch in range(epochs):

grad = gradient(x[0], x[1])

# Update velocity using momentum

v = beta * v -learning_rate * grad

x = x + v

trajectory.append(x)

return np.array(trajectory)

Get the required parameters

`# Set parameters`

learning_rate = 0.1

beta = 0.9

epochs = 50

# Run Momentum-based Gradient Descent

trajectory = momentum_gradient_descent(learning_rate, beta, epochs)

x = np.linspace(-3, 3, 100)

y = np.linspace(-3, 3, 100)

X, Y = np.meshgrid(x, y)

Z = loss_function(X, Y)

Visualize the 3D loss landscape and optimization trajectory

`fig = plt.figure(figsize=(10, 8))`

ax = fig.add_subplot(111, projection='3d')

ax.plot_surface(X, Y, Z, cmap='viridis', alpha=0.8)

ax.plot(

trajectory[:, 0],

trajectory[:, 1],

loss_function(

trajectory[:, 0],

trajectory[:, 1]

),

marker='o',

color='red',

label='Momentum-based GD Trajectory'

)

ax.set_xlabel('Parameter 1')

ax.set_ylabel('Parameter 2')

ax.set_zlabel('Loss')

ax.set_title('Momentum-based Gradient Descent in 3D Loss Landscape')

ax.legend()

plt.show()

plot the graph

below is the 2d plot for the momentum based gradient descent

below is the animated graph generated epoch wise.

**Quantum Algorithm:**

We use qiskit to apply momentum based gradient descent to the quantum data.

import required libraries

`import numpy as np`

from qiskit import QuantumCircuit

from qiskit_aer import Aer

Define objective function and its gradient

`# Objective function and its gradient`

def objective_function(x):

return x**2

def gradient_function(x):

return 2*x

initialize the parameters, circuit parameters and quantum circuit

`x = 2`

learning_rate = 0.1

num_iterations = 10

beta = 0.9 # Momentum coefficient

# Initialize momentum

momentum = 0

# Quantum circuit parameters

num_qubits = 1

shots = 1000

# Initialize quantum circuit

qc = QuantumCircuit(num_qubits, num_qubits)

Define encoding block

`# Quantum gate for encoding parameter`

def encode_parameter(qc, x):

# Scale the parameter to fit within the range [-1, 1]

scaled_x = (2 * np.clip(x, 0, 1)) - 1

qc.ry(2 * np.arcsin(np.sqrt((scaled_x + 1) / 2)), 0)

measurement and simulator block

`# Quantum gate for measuring objective function`

def measure_objective(qc):

qc.measure(0, 0)

# Get simulator backend

simulator = Aer.get_backend('qasm_simulator')

Gradient descent iteration

`# Run gradient descent iterations`

for i in range(num_iterations):

# Update momentum

momentum = beta * momentum - learning_rate * gradient

# Update parameter using momentum-based gradient descent

x += momentum

# Reset quantum circuit for each iteration

qc.data = []

# Encode parameter into quantum state

encode_parameter(qc, x)

# Measure objective function

measure_objective(qc)

# Execute the circuit and obtain measurement results

job = simulator.run(qc, shots=shots)

result = job.result()

counts = result.get_counts(qc)

# Estimate objective function value

estimated_value = (counts.get('0', 0) - counts.get('1', 0)) / shots

print(f"""Iteration {i+1}: Parameter = {x},

Objective function value = {estimated_value}""")

# Print final result

print(f"Optimized parameter value: {x}")

Results:

`Iteration 1: Parameter = 1.6,`

Objective function value = -1.0

Iteration 2: Parameter = 0.9199999999999999,

Objective function value = -0.848

Iteration 3: Parameter = 0.12399999999999967,

Objective function value = 0.738

Iteration 4: Parameter = -0.6172000000000005,

Objective function value = 1.0

Iteration 5: Parameter = -1.1608400000000008,

Objective function value = 1.0

Iteration 6: Parameter = -1.4179480000000007,

Objective function value = 1.0

Iteration 7: Parameter = -1.3657556000000004,

Objective function value = 1.0

Iteration 8: Parameter = -1.04563132,

Objective function value = 1.0

Iteration 9: Parameter = -0.5483932039999998,

Objective function value = 1.0

Iteration 10: Parameter = 0.008799741200000377,

Objective function value = 0.978

Optimized parameter value: 0.008799741200000377

Now let’s apply the quantum circuit using qiskit with iris dataset.

import required modules

`import numpy as np`

from sklearn.datasets import load_iris

from sklearn.preprocessing import MinMaxScaler

from qiskit import QuantumCircuit

from qiskit_aer import Aer

load the iris data using sklearn

`# Load Iris dataset`

iris = load_iris()

X = iris.data

y = iris.target

Normalize the data using Min Max

`# Normalize feature values`

scaler = MinMaxScaler()

X_normalized = scaler.fit_transform(X)

Define Objective and gradient function

`# Define objective function and its gradient`

def objective_function(w):

# Simple objective function:

# sum of squared differences between actual and predicted class labels

predicted_labels = np.argmax(np.dot(X_normalized, w), axis=1)

return np.sum((predicted_labels - y)**2)

def gradient_function(w):

# Gradient of the objective function

predicted_labels = np.dot(X_normalized, w)

error = predicted_labels - np.eye(3)[y]

gradient = 2 * np.dot(X_normalized.T, error)

return gradient

Initialize parameters, quantum parameters and quantum circuit

`# Initialize parameters`

num_features = X_normalized.shape[1]

num_classes = len(np.unique(y))

w = np.random.rand(num_features, num_classes)

learning_rate = 0.01

num_iterations = 10

beta = 0.9 # Momentum coefficient

# Initialize momentum

momentum = np.zeros_like(w)

# Quantum circuit parameters

num_qubits = num_features

shots = 1000

# Initialize quantum circuit

qc = QuantumCircuit(num_qubits, num_qubits)

define encode function and measurement

`# Quantum gate for encoding parameters`

def encode_parameters(qc, w):

for i in range(num_qubits):

for j in range(num_classes):

# Clip values to ensure they fall within [0, 1] range

clipped_value = np.clip(w[i, j], 0, 1)

# Apply arcsin gate

qc.ry(2 * np.arcsin(np.sqrt(clipped_value)), i)

# Quantum gate for measuring objective function

def measure_objective(qc):

qc.measure_all()

gradient iterations

`# Run gradient descent iterations`

for i in range(num_iterations):

# Compute gradient

gradient = gradient_function(w)

# Update momentum

momentum = beta * momentum - learning_rate * gradient

# Update parameters using momentum-based gradient descent

w += momentum

# Reset quantum circuit for each iteration

qc.data = []

# Encode parameters into quantum state

encode_parameters(qc, w)

# Measure objective function

measure_objective(qc)

# Execute the circuit and obtain measurement results

simulator = Aer.get_backend('qasm_simulator')

job = simulator.run(qc, shots=shots)

result = job.result()

counts = result.get_counts(qc)

# Estimate objective function value

estimated_value = (

counts.get('0' * num_qubits, 0) - counts.get('1' * num_qubits, 0)

) / shots

print(f'''Iteration {i+1}: Parameters = {w},

Objective function value = {estimated_value}''')

# Print final result

print(f"Optimized parameters: {w}")

Results

`Iteration 1: Parameters = [[-0.29964495 -0.31073152 -0.44592472]`

[ 0.27494478 -0.36202519 -0.42434317]

[-0.39818639 0.12698513 -1.25460153]

[-0.22939201 -0.37804032 -0.41751743]],

Objective function value = 0.0

Iteration 2: Parameters = [[ 0.09679955 0.25386389 0.84328198]

[ 1.31812401 0.13279514 0.28474462]

[-0.19391962 0.78700958 0.36008895]

[-0.06217427 0.20745267 1.24764907]],

Objective function value = 0.0

Iteration 3: Parameters = [[ 0.05246948 0.19013628 0.6460111 ]

[ 2.04158037 0.12429927 -0.20478765]

[-0.49444896 0.74702635 0.29716278]

[-0.4048776 0.06536049 1.2575407 ]],

Objective function value = 0.0

Iteration 4: Parameters = [[-0.25912711 -0.2490117 -0.43751992]

[ 2.36013153 -0.19721495 -1.30189119]

[-1.02072521 0.29829862 -0.80427938]

[-0.97123206 -0.50932289 0.24296033]],

Objective function value = 0.0

Iteration 5: Parameters = [[ 0.07882118 0.25368218 0.69261025]

[ 2.86532938 0.2186069 -0.44663841]

[-0.67821041 0.94029062 0.55071991]

[-0.66433912 -0.01695024 1.68420032]],

Objective function value = 0.0

Iteration 6: Parameters = [[-0.0197069 0.13960546 0.37909869]

[ 2.6538278 0.1183774 -0.56507425]

[-0.69118038 0.90111653 0.19743774]

[-0.70367342 -0.2208152 1.42445917]],

Objective function value = 0.0

Iteration 7: Parameters = [[-0.28416977 -0.20640208 -0.47120535]

[ 2.02407043 -0.18304538 -0.9715226 ]

[-0.78328385 0.61551574 -0.82029383]

[-0.81584144 -0.68395919 0.5034323 ]],

Objective function value = 0.0

Iteration 8: Parameters = [[-0.00496432 0.22863605 0.52302355]

[ 1.70380373 0.13845403 0.166565 ]

[-0.2073544 1.23053489 0.23156122]

[-0.26045133 -0.25942147 1.6479053 ]],

Objective function value = 0.0

Iteration 9: Parameters = [[-0.13294644 0.07529964 0.14575957]

[ 1.10877931 -0.00594372 0.17369808]

[-0.11171671 1.18474846 -0.27100427]

[-0.18824884 -0.50171798 1.24432998]],

Objective function value = 0.0

Iteration 10: Parameters = [[-0.3382088 -0.19755416 -0.50068498]

[ 0.65044262 -0.23019265 -0.13908908]

[-0.17645434 1.00697057 -1.04938551]

[-0.28461234 -0.88095847 0.56827026]],

Objective function value = 0.0

Optimized parameters: [[-0.3382088 -0.19755416 -0.50068498]

[ 0.65044262 -0.23019265 -0.13908908]

[-0.17645434 1.00697057 -1.04938551]

[-0.28461234 -0.88095847 0.56827026]]

**Conclusion:**

Momentum-based Gradient Descent is a powerful optimization algorithm that addresses some of the challenges faced by basic Gradient Descent. Its ability to build up speed and reduce oscillations contributes to faster convergence and more stable training of machine learning models.

update: 26/04/2024 — added plots to highlight the comparison between gradient descent and momentum based gradient descent in the advantages section.

reference: IIT-M course, Wikipedia, google colab, stack exchange, open sources.