# QAlog: Quantum Momentum Based Gradient Descent

## How to escape local minima?

11 min readApr 27, 2024

In the previous article, we have discussed about the gradient descent algorithm and it’s quantum implementation.

In this article, we discuss about the momentum based Gradient Descent algorithm, which overcomes some of the problems of gradient descent algorithm.

Introduction:

Gradient Descent is an optimization algorithm used to minimize the cost or loss function of a model during training. While Vanilla Gradient Descent is effective, it sometimes faces challenges like slow convergence, oscillations, and difficulties in escaping local minima.

In other ways, Gradient Descent takes a lot of time to navigate regions having a gentle slope, because the gradient in these regions is very small.

Slow Convergence:

Gradient descent can converge slowly, especially in regions with shallow gradients or plateaus. In such cases, the updates to the parameters can be very small, leading to slow progress towards the minimum of the loss function.

Sensitivity to Learning Rate:

The choice of learning rate (α) in gradient descent is critical. A learning rate that is too large can cause the optimization process to overshoot the minimum and diverge, while a learning rate that is too small can result in slow convergence.

Oscillations and Zigzagging:

In regions with high curvature or sharp turns, gradient descent may oscillate or zigzag around the minimum rather than converging smoothly. This can prolong the optimization process and make it less stable.

Local Minima and Saddle Points:

Gradient descent is prone to getting stuck in local minima or saddle points, especially in high-dimensional spaces. These stationary points can hinder progress towards the global minimum of the loss function.

Noisy or Sparse Gradients:

In scenarios where gradients exhibit noise or sparsity, the effectiveness of gradient descent in achieving convergence may be compromised. Noisy gradients introduce inconsistency, resulting in unpredictable updates. On the other hand, sparse gradients offer only partial information, making it difficult to accurately discern the descent direction. Consequently, gradient descent encounters challenges in efficiently navigating the parameter space, hindering its ability to converge towards the optimal solution.

Vanishing or Exploding Gradients:

In deep neural networks, gradient descent can suffer from vanishing or exploding gradients, especially in deep architectures. Vanishing gradients lead to slow training, while exploding gradients can cause the optimization process to diverge.

Momentum-based Gradient Descent, introduces a concept called momentum to address these issues.

As the gradient consistently advances in a singular direction through successive iterations, in accordance with the principles of Newtonian physics, the entity is poised to acquire increased inertia, facilitating swifter movement. By adhering to this principle, maintaining a consistent direction fosters confidence, encouraging the adoption of bigger steps over smaller ones.

we have seen the expression for gradient descent as shown below

Momentum in Optimization:

Momentum is inspired by physics and involves accumulating the past gradients to give the optimization process a sense of inertia. In other words, momentum helps the optimization algorithm build up speed in the directions where the gradient consistently points. This tends in faster convergence and smoother navigation through complex loss landscapes.

The inclusion of a momentum term to the Gradient Descent can be expressed as follows

where β is the momentum parameter 0 < β < 1, where v_t is the velocity vector at iteration t. The velocity vector is a weighted average of the past gradients, where the weight of the past gradients decays exponentially with the factor β. The weight vector is updated by adding the velocity vector to the previous weight vector.

here if we assume, β as mass and v as velocity, then we can consider β * v (mass * velocity) which is momentum(p), p=m⋅v.

when the value of β is zero, then the expression is same as gradient descent.

Mathematical interpretation:

Let’s check the how much previous velocity contributes.

if we replace with the previous velocity

by expanding we get

now further replace the velocity vector with the previous value

by expanding we get

we can get the general form as shown below

where v_0​ is the initial velocity vector, which is usually set to zero. Assuming that β<1, we can see that the first momentum term​ vanishes as t increases, and the velocity vector becomes a weighted sum of the past gradients. The weight of the past gradients decreases exponentially with the factor β, giving more importance to the recent gradients.

we can also say this as exponential decay.

Accelerated Convergence:

Traditional gradient descent often converges slowly, especially in regions of high curvature or near saddle points.

Momentum helps to accelerate convergence by accumulating gradients from past iterations, enabling faster progress towards the minimum.

Shallow Local Minima:

Gradient descent can get stuck in shallow local minima and struggle to escape due to the small step sizes it takes.

Momentum allows the optimizer to traverse through such regions more effectively, enabling it to escape shallow local minima and continue searching for better solutions.

Learning Rate Sensitivity:

Traditional gradient descent requires careful tuning of the learning rate, which can be time-consuming and challenging, especially for complex optimization landscapes.

Momentum reduces the sensitivity to the learning rate, allowing for more robust and stable optimization across a wide range of learning rates.

Escaping Plateaus:

Gradient descent struggles to escape plateaus, where the gradient magnitude is very small, leading to slow progress.

Momentum allows the optimizer to maintain velocity even on flat regions, facilitating faster traversal through plateaus.

python code for momentum based gradient descent

`import numpy as npimport matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3D`

`def loss_function(x, y): return x**2 + y**2`

Gradient of the loss function or vector with partial derivatives

`def gradient(x, y): return np.array([2 * x, 2 * y])`

`def momentum_gradient_descent(learning_rate, beta, epochs):    x = np.array([2.0, 2.0])  # Initial parameters    v = np.zeros_like(x)    trajectory = [x]    for epoch in range(epochs):        grad = gradient(x[0], x[1])        # Update velocity using momentum        v = beta * v -learning_rate * grad        x = x + v        trajectory.append(x)    return np.array(trajectory)`

Get the required parameters

`# Set parameterslearning_rate = 0.1beta = 0.9epochs = 50# Run Momentum-based Gradient Descenttrajectory = momentum_gradient_descent(learning_rate, beta, epochs)x = np.linspace(-3, 3, 100)y = np.linspace(-3, 3, 100)X, Y = np.meshgrid(x, y)Z = loss_function(X, Y)`

Visualize the 3D loss landscape and optimization trajectory

`fig = plt.figure(figsize=(10, 8))ax = fig.add_subplot(111, projection='3d')ax.plot_surface(X, Y, Z, cmap='viridis', alpha=0.8)ax.plot(    trajectory[:, 0],     trajectory[:, 1],     loss_function(      trajectory[:, 0],       trajectory[:, 1]    ),    marker='o',     color='red',     label='Momentum-based GD Trajectory')ax.set_xlabel('Parameter 1')ax.set_ylabel('Parameter 2')ax.set_zlabel('Loss')ax.set_title('Momentum-based Gradient Descent in 3D Loss Landscape')ax.legend()plt.show()`

plot the graph

below is the 2d plot for the momentum based gradient descent

below is the animated graph generated epoch wise.

Quantum Algorithm:

We use qiskit to apply momentum based gradient descent to the quantum data.

import required libraries

`import numpy as npfrom qiskit import QuantumCircuitfrom qiskit_aer import Aer`

Define objective function and its gradient

`# Objective function and its gradientdef objective_function(x):    return x**2def gradient_function(x):    return 2*x`

initialize the parameters, circuit parameters and quantum circuit

`x = 2learning_rate = 0.1num_iterations = 10beta = 0.9  # Momentum coefficient# Initialize momentummomentum = 0# Quantum circuit parametersnum_qubits = 1shots = 1000# Initialize quantum circuitqc = QuantumCircuit(num_qubits, num_qubits)`

Define encoding block

`# Quantum gate for encoding parameterdef encode_parameter(qc, x):    # Scale the parameter to fit within the range [-1, 1]    scaled_x = (2 * np.clip(x, 0, 1)) - 1    qc.ry(2 * np.arcsin(np.sqrt((scaled_x + 1) / 2)), 0)`

measurement and simulator block

`# Quantum gate for measuring objective functiondef measure_objective(qc):    qc.measure(0, 0)# Get simulator backendsimulator = Aer.get_backend('qasm_simulator')`

`# Run gradient descent iterationsfor i in range(num_iterations):    # Update momentum    momentum = beta * momentum - learning_rate * gradient    # Update parameter using momentum-based gradient descent    x += momentum    # Reset quantum circuit for each iteration    qc.data = []    # Encode parameter into quantum state    encode_parameter(qc, x)    # Measure objective function    measure_objective(qc)    # Execute the circuit and obtain measurement results    job = simulator.run(qc, shots=shots)    result = job.result()    counts = result.get_counts(qc)    # Estimate objective function value    estimated_value = (counts.get('0', 0) - counts.get('1', 0)) / shots    print(f"""Iteration {i+1}: Parameter = {x},            Objective function value = {estimated_value}""")# Print final resultprint(f"Optimized parameter value: {x}")`

Results:

`Iteration 1: Parameter = 1.6,            Objective function value = -1.0Iteration 2: Parameter = 0.9199999999999999,            Objective function value = -0.848Iteration 3: Parameter = 0.12399999999999967,            Objective function value = 0.738Iteration 4: Parameter = -0.6172000000000005,            Objective function value = 1.0Iteration 5: Parameter = -1.1608400000000008,            Objective function value = 1.0Iteration 6: Parameter = -1.4179480000000007,            Objective function value = 1.0Iteration 7: Parameter = -1.3657556000000004,            Objective function value = 1.0Iteration 8: Parameter = -1.04563132,            Objective function value = 1.0Iteration 9: Parameter = -0.5483932039999998,            Objective function value = 1.0Iteration 10: Parameter = 0.008799741200000377,            Objective function value = 0.978Optimized parameter value: 0.008799741200000377`

Now let’s apply the quantum circuit using qiskit with iris dataset.

import required modules

`import numpy as npfrom sklearn.datasets import load_irisfrom sklearn.preprocessing import MinMaxScalerfrom qiskit import QuantumCircuitfrom qiskit_aer import Aer`

load the iris data using sklearn

`# Load Iris datasetiris = load_iris()X = iris.datay = iris.target`

Normalize the data using Min Max

`# Normalize feature valuesscaler = MinMaxScaler()X_normalized = scaler.fit_transform(X)`

Define Objective and gradient function

`# Define objective function and its gradientdef objective_function(w):    # Simple objective function:     # sum of squared differences between actual and predicted class labels    predicted_labels = np.argmax(np.dot(X_normalized, w), axis=1)    return np.sum((predicted_labels - y)**2)def gradient_function(w):    # Gradient of the objective function    predicted_labels = np.dot(X_normalized, w)    error = predicted_labels - np.eye(3)[y]    gradient = 2 * np.dot(X_normalized.T, error)    return gradient`

Initialize parameters, quantum parameters and quantum circuit

`# Initialize parametersnum_features = X_normalized.shape[1]num_classes = len(np.unique(y))w = np.random.rand(num_features, num_classes)  learning_rate = 0.01num_iterations = 10beta = 0.9  # Momentum coefficient# Initialize momentummomentum = np.zeros_like(w)# Quantum circuit parametersnum_qubits = num_featuresshots = 1000# Initialize quantum circuitqc = QuantumCircuit(num_qubits, num_qubits)`

define encode function and measurement

`# Quantum gate for encoding parametersdef encode_parameters(qc, w):    for i in range(num_qubits):        for j in range(num_classes):            # Clip values to ensure they fall within [0, 1] range            clipped_value = np.clip(w[i, j], 0, 1)            # Apply arcsin gate            qc.ry(2 * np.arcsin(np.sqrt(clipped_value)), i)# Quantum gate for measuring objective functiondef measure_objective(qc):    qc.measure_all()`

`# Run gradient descent iterationsfor i in range(num_iterations):    # Compute gradient    gradient = gradient_function(w)        # Update momentum    momentum = beta * momentum - learning_rate * gradient    # Update parameters using momentum-based gradient descent    w += momentum    # Reset quantum circuit for each iteration    qc.data = []    # Encode parameters into quantum state    encode_parameters(qc, w)    # Measure objective function    measure_objective(qc)    # Execute the circuit and obtain measurement results    simulator = Aer.get_backend('qasm_simulator')    job = simulator.run(qc, shots=shots)    result = job.result()    counts = result.get_counts(qc)    # Estimate objective function value    estimated_value = (        counts.get('0' * num_qubits, 0) - counts.get('1' * num_qubits, 0)        ) / shots    print(f'''Iteration {i+1}: Parameters = {w},            Objective function value = {estimated_value}''')# Print final resultprint(f"Optimized parameters: {w}")`

Results

`Iteration 1: Parameters = [[-0.29964495 -0.31073152 -0.44592472] [ 0.27494478 -0.36202519 -0.42434317] [-0.39818639  0.12698513 -1.25460153] [-0.22939201 -0.37804032 -0.41751743]],            Objective function value = 0.0Iteration 2: Parameters = [[ 0.09679955  0.25386389  0.84328198] [ 1.31812401  0.13279514  0.28474462] [-0.19391962  0.78700958  0.36008895] [-0.06217427  0.20745267  1.24764907]],            Objective function value = 0.0Iteration 3: Parameters = [[ 0.05246948  0.19013628  0.6460111 ] [ 2.04158037  0.12429927 -0.20478765] [-0.49444896  0.74702635  0.29716278] [-0.4048776   0.06536049  1.2575407 ]],            Objective function value = 0.0Iteration 4: Parameters = [[-0.25912711 -0.2490117  -0.43751992] [ 2.36013153 -0.19721495 -1.30189119] [-1.02072521  0.29829862 -0.80427938] [-0.97123206 -0.50932289  0.24296033]],            Objective function value = 0.0Iteration 5: Parameters = [[ 0.07882118  0.25368218  0.69261025] [ 2.86532938  0.2186069  -0.44663841] [-0.67821041  0.94029062  0.55071991] [-0.66433912 -0.01695024  1.68420032]],            Objective function value = 0.0Iteration 6: Parameters = [[-0.0197069   0.13960546  0.37909869] [ 2.6538278   0.1183774  -0.56507425] [-0.69118038  0.90111653  0.19743774] [-0.70367342 -0.2208152   1.42445917]],            Objective function value = 0.0Iteration 7: Parameters = [[-0.28416977 -0.20640208 -0.47120535] [ 2.02407043 -0.18304538 -0.9715226 ] [-0.78328385  0.61551574 -0.82029383] [-0.81584144 -0.68395919  0.5034323 ]],            Objective function value = 0.0Iteration 8: Parameters = [[-0.00496432  0.22863605  0.52302355] [ 1.70380373  0.13845403  0.166565  ] [-0.2073544   1.23053489  0.23156122] [-0.26045133 -0.25942147  1.6479053 ]],            Objective function value = 0.0Iteration 9: Parameters = [[-0.13294644  0.07529964  0.14575957] [ 1.10877931 -0.00594372  0.17369808] [-0.11171671  1.18474846 -0.27100427] [-0.18824884 -0.50171798  1.24432998]],            Objective function value = 0.0Iteration 10: Parameters = [[-0.3382088  -0.19755416 -0.50068498] [ 0.65044262 -0.23019265 -0.13908908] [-0.17645434  1.00697057 -1.04938551] [-0.28461234 -0.88095847  0.56827026]],            Objective function value = 0.0Optimized parameters: [[-0.3382088  -0.19755416 -0.50068498] [ 0.65044262 -0.23019265 -0.13908908] [-0.17645434  1.00697057 -1.04938551] [-0.28461234 -0.88095847  0.56827026]]`

Conclusion:

Momentum-based Gradient Descent is a powerful optimization algorithm that addresses some of the challenges faced by basic Gradient Descent. Its ability to build up speed and reduce oscillations contributes to faster convergence and more stable training of machine learning models.

update: 26/04/2024 — added plots to highlight the comparison between gradient descent and momentum based gradient descent in the advantages section.

reference: IIT-M course, Wikipedia, google colab, stack exchange, open sources.