QAlog: Quantum NAG

How to think effectively before a step?

Anonymousket
10 min readMay 3, 2024

In the previous article, we explored the limitations of gradient descent and how conventional momentum-based gradient descent addresses them.

In this article, we delve into limitations of vanilla momentum gradient descent and examine how Nesterov gradient descent tackles some of these obstacles along with a quantum algorithm.

Introduction:

Before diving into NAG, let’s revisit the fundamental concept of momentum based gradient descent. At its core, it involves accumulating the past gradients to give the optimization process a sense of inertia.

In other words, momentum helps the optimization algorithm build up speed in the directions where the gradient consistently points. This tends in faster convergence and smoother navigation through complex loss landscapes.

However, despite its effectiveness, vanilla momentum gradient descent is not without its limitations, following are the limitations which we walk through one by one

Inertia Effect:

Vanilla momentum-based gradient descent can suffer from the inertia effect, where the accumulated momentum causes the optimizer to continue moving in the same direction, even when it’s close to the minimum of the loss function. This can result in slower convergence or oscillations around the minimum.

this is an animated image.
  1. Initial Movement: At the beginning of the animation, you’ll see the optimizer (depicted as a red point) starting from an initial position and moving towards the minimum of the loss function.
  2. Momentum Effect: As the optimizer progresses, you’ll notice that it gains momentum, moving faster and overshooting the minimum due to the accumulated momentum. This overshooting is a manifestation of the inertia effect.
  3. Oscillations: After overshooting, the optimizer might start oscillating around the minimum. This oscillation occurs because the momentum carries it past the minimum, and then the gradient pulls it back. This back-and-forth movement illustrates the inertia effect causing slower convergence or oscillations around the minimum.
  4. Convergence: Eventually, the oscillations dampen, and the optimizer converges to the minimum of the loss function.

Increased Sensitivity to Learning Rate:

Vanilla momentum-based gradient descent is more sensitive to the learning rate. Setting an inappropriate learning rate can lead to overshooting, oscillations, or divergence.

this is an animated image.

Potential for Suboptimal Convergence:

Vanilla momentum may exhibit suboptimal convergence behavior in certain scenarios, particularly when dealing with highly non-convex or ill-conditioned optimization landscapes.

this is an animated image.

Observe how the trajectory moves through the landscape. In some regions, especially around steep valleys or plateaus, the trajectory may display irregular patterns, such as oscillations or overshooting. This behavior indicates suboptimal convergence, where the algorithm struggles to efficiently find the global minimum.

Nesterov introduces a concept called look ahead to address these issues.

Nesterov Accelerated Gradient Descent (NAG):

Nesterov Accelerated Gradient Descent is an optimization algorithm that builds upon the concept of momentum in gradient descent.

In the momentum-based gradient descent article, we updated the parameters using the accumulated gradient velocity.

However, Nesterov’s insight was to compute the gradient not at the current position but at a position slightly ahead in the direction of momentum.

So, by combining the current position with the momentum and computing the gradient, we realize Nesterov’s vision for optimization.

image created using paint.

The look-ahead mechanism allows the optimizer to “peek” ahead and evaluate the gradient at a position that accounts for the momentum.

By doing so, NAG can adjust the model parameters more effectively, potentially avoiding overshooting and improving convergence.

NAG harnesses momentum information more efficiently, resulting in faster convergence, greater stability, and better performance in navigating optimization landscapes.

Mathematical understanding:

The inclusion of a look a head term to the Gradient Descent can be expressed as follows

where β is the momentum parameter 0 < β < 1, where v_t is the velocity vector at iteration t and x_t is the step at iteration t.

v_lookahead represents the “lookahead” position where we evaluate the gradient.

the modified gradient can be expressed as

This lookahead allows the algorithm to anticipate where it will be in the next step and adjust the update accordingly, leading to faster convergence.

Now, let’s update to the final step with velocity vector as shown below

Advantages:

Nesterov Accelerated Gradient Descent (NAG) addresses some of the drawbacks of vanilla momentum-based gradient descent by incorporating a “look ahead” mechanism, which helps to alleviate the issues associated with momentum. Here’s how NAG addresses these drawbacks:

Overshooting:

NAG reduces overshooting by adjusting the gradient calculation based on the current momentum’s direction. Instead of blindly following the momentum, NAG calculates the gradient not at the current parameter values but at the “look ahead” position, which is ahead in the direction of the momentum. This adjustment allows NAG to anticipate the effect of momentum and adjust the gradient accordingly, reducing overshooting.

Increased Sensitivity to Learning Rate:

NAG is less sensitive to the learning rate compared to vanilla momentum-based gradient descent. The “look ahead” mechanism helps in stabilizing convergence by effectively adjusting the step size based on the momentum’s direction. This allows for more robust convergence across a wider range of learning rates.

Lack of Extra “Look Ahead”:

NAG incorporates the “look ahead” term, which provides additional information about the gradient’s direction at a future position. By considering this information, NAG can make more informed decisions about the gradient descent step, leading to more stable and efficient convergence, especially in scenarios with high curvature or noisy gradients.

the dotted lines represent the look a head feature of NAG which corrects the path.

python code for momentum based gradient descent

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

a quadratic function

def objective_function(x):
return x**2

Gradient of the loss function or vector with partial derivatives

def gradient(x):
return 2*x

define Nesterov Accelerated Gradient Descent optimization function

def nesterov_gradient_descent(initial_position, learning_rate, 
momentum, num_iterations):
# Initialize parameters
current_position = initial_position
velocity = 0

# Lists to store the trajectory for plotting
positions = [initial_position]

# Optimization loop
for i in range(num_iterations):
# Calculate the momentum update
v_lookahead = current_position + momentum * velocity
velocity = momentum * velocity - learning_rate * gradient(v_lookahead)

# Update the parameters using the momentum
current_position += velocity

# Store the current position
positions.append(current_position)

return np.array(positions)

Get the required parameters

# Set parameters
initial_position = 3.0
learning_rate = 0.1
momentum = 0.9
num_iterations = 10

trajectory = nesterov_gradient_descent(initial_position, learning_rate, momentum, num_iterations)

Visualize the 3D loss landscape and optimization trajectory

# Plot the objective function in 3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

x = np.linspace(-5, 5, 100)
y = objective_function(x)
ax.plot(x, y, zs=0, zdir='y', label='Objective Function')

# Plot the trajectory of the optimization algorithm
z_trajectory = objective_function(trajectory)
ax.scatter(trajectory, z_trajectory, zs=0, zdir='y', color='red', label='Optimization Trajectory')

# Annotate the starting point
ax.scatter(initial_position, objective_function(initial_position), zs=0, zdir='y', color='green', label='Initial Position')

# Add labels and legend
ax.set_xlabel('X')
ax.set_ylabel('Objective Function Value')
ax.set_zlabel('f(X)')
ax.set_title('Nesterov Accelerated Gradient Descent')
ax.legend()

# Show plot
plt.show()

plot the graph

below is the 2d plot for the momentum based gradient descent

below is the animated graph.

this is an animated image.

Quantum Algorithm:

We use qiskit to apply NAG to the quantum data.

import required libraries

import numpy as np
from qiskit import QuantumCircuit
from qiskit_aer import Aer

Define objective function and its gradient

# Objective function and its gradient
def objective_function(x):
return x**2

def gradient_function(x):
return 2*x

initialize the parameters, circuit parameters and quantum circuit

x = 2
learning_rate = 0.1
num_iterations = 10
beta = 0.9 # Momentum coefficient

# Initialize momentum
momentum = 0
# Quantum circuit parameters
num_qubits = 1
shots = 1000
# Initialize quantum circuit
qc = QuantumCircuit(num_qubits, num_qubits)

Define encoding block

# Quantum gate for encoding parameter
def encode_parameter(qc, x):
# Scale the parameter to fit within the range [-1, 1]
scaled_x = (2 * np.clip(x, 0, 1)) - 1
qc.ry(2 * np.arcsin(np.sqrt((scaled_x + 1) / 2)), 0)

measurement and simulator block

# Quantum gate for measuring objective function
def measure_objective(qc):
qc.measure(0, 0)

# Get simulator backend
simulator = Aer.get_backend('qasm_simulator')

Gradient descent iteration

# Run NAG iterations
for i in range(num_iterations):
# Compute gradient at the lookahead position
lookahead_position = x + beta * momentum
lookahead_gradient = gradient_function(lookahead_position)

# Update momentum
momentum = beta * momentum - learning_rate * lookahead_gradient

# Update parameter using NAG
x += momentum

# Reset quantum circuit for each iteration
qc.data = []

# Encode parameter into quantum state
encode_parameter(qc, x)

# Measure objective function
measure_objective(qc)

# Execute the circuit and obtain measurement results
job = simulator.run(qc, shots=shots)
result = job.result()
counts = result.get_counts(qc)

# Estimate objective function value
estimated_value = (counts.get('0', 0) - counts.get('1', 0)) / shots

print(f"""Iteration {i+1}: Parameter = {x},
Objective function value = {estimated_value}""")

# Print final result
print(f"Optimized parameter value: {x}")

Results:

Iteration 1: Parameter = 1.6, 
Objective function value = -1.0
Iteration 2: Parameter = 0.992,
Objective function value = -0.98
Iteration 3: Parameter = 0.35583999999999993,
Objective function value = 0.33
Iteration 4: Parameter = -0.17336320000000005,
Objective function value = 1.0
Iteration 5: Parameter = -0.5197168640000001,
Objective function value = 1.0
Iteration 6: Parameter = -0.6651481292800001,
Objective function value = 1.0
Iteration 7: Parameter = -0.6368290144256001,
Objective function value = 1.0
Iteration 8: Parameter = -0.4890734488453121,
Objective function value = 1.0
Iteration 9: Parameter = -0.2848747518584423,
Objective function value = 1.0
Iteration 10: Parameter = -0.08087673965620756,
Objective function value = 1.0

Optimized parameter value: -0.08087673965620756

Now let’s apply the quantum circuit using qiskit with iris dataset.

import required modules

import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import MinMaxScaler
from qiskit import QuantumCircuit
from qiskit_aer import Aer

load the iris data using sklearn

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

Normalize the data using Min Max

# Normalize feature values
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)

Define Objective and gradient function

# Define objective function and its gradient
def objective_function(w):
# Simple objective function:
# sum of squared differences between actual and predicted class labels
predicted_labels = np.argmax(np.dot(X_normalized, w), axis=1)
return np.sum((predicted_labels - y)**2)

def gradient_function(w):
# Gradient of the objective function
predicted_labels = np.dot(X_normalized, w)
error = predicted_labels - np.eye(3)[y]
gradient = 2 * np.dot(X_normalized.T, error)
return gradient

Initialize parameters, quantum parameters and quantum circuit

# Initialize parameters
num_features = X_normalized.shape[1]
num_classes = len(np.unique(y))
w = np.random.rand(num_features, num_classes)
learning_rate = 0.01
num_iterations = 10
beta = 0.9 # Momentum coefficient

# Initialize momentum
momentum = np.zeros_like(w)
# Quantum circuit parameters
num_qubits = num_features
shots = 1000
# Initialize quantum circuit
qc = QuantumCircuit(num_qubits, num_qubits)

define encode function and measurement

# Quantum gate for encoding parameters
def encode_parameters(qc, w):
for i in range(num_qubits):
for j in range(num_classes):
# Clip values to ensure they fall within [0, 1] range
clipped_value = np.clip(w[i, j], 0, 1)
# Apply arcsin gate
qc.ry(2 * np.arcsin(np.sqrt(clipped_value)), i)

# Quantum gate for measuring objective function
def measure_objective(qc):
qc.measure_all()

NAG iterations

# Run NAG iterations
for i in range(num_iterations):
# Compute lookahead gradient
lookahead_position = w + beta * momentum
lookahead_gradient = gradient_function(lookahead_position)

# Update momentum
momentum = beta * momentum - learning_rate * lookahead_gradient

# Update parameters using NAG with lookahead
w += momentum

# Reset quantum circuit for each iteration
qc.data = []

# Encode parameters into quantum state
encode_parameters(qc, w)

# Measure objective function
measure_objective(qc)

# Execute the circuit and obtain measurement results
simulator = Aer.get_backend('qasm_simulator')
job = simulator.run(qc, shots=shots)
result = job.result()
counts = result.get_counts(qc)

# Estimate objective function value
estimated_value = (
counts.get('0' * num_qubits, 0) - counts.get('1' * num_qubits, 0)
) / shots

print(f"""Iteration {i+1}: Parameter = {w},
Objective function value = {estimated_value}""")

# Print final result
print(f"Optimized parameters: {w}")

Results

Iteration 1: Parameter = [[-0.24374827 -0.9061288   0.24130184]
[-0.11497702 -0.35900118 -0.35002827]
[-0.78395872 -1.07266658 -0.4334284 ]
[-0.82933315 -0.67215263 0.16625303]],
Objective function value = 0.0
Iteration 2: Parameter = [[2.94848702 4.12393329 1.89087207]
[3.09152589 3.47562979 0.90089741]
[2.66152912 4.74246409 1.51012043]
[2.58849168 5.04187833 2.15980021]],
Objective function value = 0.0
Iteration 3: Parameter = [[-10.24967873 -16.4462836 -5.07660408]
[ -6.94205584 -13.23759166 -4.74236007]
[-12.47630043 -18.53836232 -6.33772821]
[-12.52430114 -18.29937927 -5.59224451]],
Objective function value = 0.0
Iteration 4: Parameter = [[42.97477125 66.8492697 22.85002257]
[36.19368516 53.56477598 17.69974329]
[47.97646051 76.2295617 25.46897623]
[47.75598556 76.05693244 26.2183139 ]],
Objective function value = 0.0
Iteration 5: Parameter = [[-172.88437268 -270.51550942 -90.59222732]
[-137.09066725 -217.55014439 -73.39482852]
[-197.33484109 -307.16645076 -103.42592698]
[-196.89830376 -306.49077917 -102.22883087]],
Objective function value = 0.0
Iteration 6: Parameter = [[ 701.22066949 1096.15045193 368.60201619]
[ 565.21094197 880.52569039 295.64155648]
[ 796.34835613 1246.34199744 418.5635158 ]
[ 794.10833193 1242.62700928 418.46329661]],
Objective function value = 0.0
Iteration 7: Parameter = [[-2839.89396057 -4439.81386737 -1491.83002666]
[-2280.06302665 -3567.41908943 -1199.0537926 ]
[-3228.55836735 -5046.12584959 -1696.09330263]
[-3219.95817656 -5033.11621468 -1690.38413142]],
Objective function value = 0.0
Iteration 8: Parameter = [[11504.0220842 17985.18479166 6044.01496836]
[ 9244.72696626 14450.41176348 4855.86793812]
[13075.82777369 20443.66619801 6869.66441944]
[13040.51144571 20387.78874118 6852.41998805]],
Objective function value = 0.0
Iteration 9: Parameter = [[-46600.21585552 -72853.28319439 -24482.28965385]
[-37440.29088386 -58535.4615944 -19670.98433709]
[-52969.08643013 -82809.14630454 -27828.49543544]
[-52826.53011265 -82587.1706104 -27752.16841873]],
Objective function value = 0.0
Iteration 10: Parameter = [[188766.41148909 295112.54149541 99172.46452635]
[151669.52466075 237113.72083839 79681.86069589]
[214564.04204116 335444.8535549 112725.55869321]
[213986.0269348 334540.08579815 112423.44267571]],
Objective function value = 0.0

Optimized parameters: [[188766.41148909 295112.54149541 99172.46452635]
[151669.52466075 237113.72083839 79681.86069589]
[214564.04204116 335444.8535549 112725.55869321]
[213986.0269348 334540.08579815 112423.44267571]]

Conclusion:
Nesterov Accelerated Gradient Descent (NAG) improves upon the drawbacks of traditional momentum-based gradient descent methods by incorporating a correction term that reduces overshooting, improves convergence near saddle points, and reduces sensitivity to the choice of hyperparameters like the momentum parameter. This makes NAG a preferred choice in many optimization scenarios, especially in training deep neural networks.

references: IIT-M course, Wikipedia, google colab, stack exchange, open sources.

--

--