# QAlog: Quantum NAG

## How to think effectively before a step?

In the previous article, we explored the limitations of gradient descent and how conventional momentum-based gradient descent addresses them.

In this article, we delve into limitations of vanilla momentum gradient descent and examine how Nesterov gradient descent tackles some of these obstacles along with a quantum algorithm.

**Introduction:**

Before diving into NAG, let’s revisit the fundamental concept of momentum based gradient descent. At its core, it involves accumulating the past gradients to give the optimization process a sense of *inertia*.

In other words, momentum helps the optimization algorithm build up speed in the directions where the gradient consistently points. This tends in faster convergence and smoother navigation through complex loss landscapes.

However, despite its effectiveness, vanilla momentum gradient descent is not without its limitations, following are the limitations which we walk through one by one

*Inertia Effect:*

Vanilla momentum-based gradient descent can suffer from the inertia effect, where the accumulated momentum causes the optimizer to continue moving in the same direction, even when it’s close to the minimum of the loss function. This can result in slower convergence or oscillations around the minimum.

*Initial Movement:*At the beginning of the animation, you’ll see the optimizer (depicted as a red point) starting from an initial position and moving towards the minimum of the loss function.*Momentum Effect:*As the optimizer progresses, you’ll notice that it gains momentum, moving faster and overshooting the minimum due to the accumulated momentum. This overshooting is a manifestation of the inertia effect.*Oscillations:*After overshooting, the optimizer might start oscillating around the minimum. This oscillation occurs because the momentum carries it past the minimum, and then the gradient pulls it back. This back-and-forth movement illustrates the inertia effect causing slower convergence or oscillations around the minimum.*Convergence:*Eventually, the oscillations dampen, and the optimizer converges to the minimum of the loss function.

*Increased Sensitivity to Learning Rate:*

Vanilla momentum-based gradient descent is more sensitive to the learning rate. Setting an inappropriate learning rate can lead to overshooting, oscillations, or divergence.

*Potential for Suboptimal Convergence:*

Vanilla momentum may exhibit suboptimal convergence behavior in certain scenarios, particularly when dealing with highly non-convex or ill-conditioned optimization landscapes.

Observe how the trajectory moves through the landscape. In some regions, especially around steep valleys or plateaus, the trajectory may display irregular patterns, such as oscillations or overshooting. This behavior indicates suboptimal convergence, where the algorithm struggles to efficiently find the global minimum.

Nesterov introduces a concept called *look ahead *to address these issues.

**Nesterov Accelerated Gradient Descent (NAG):**

Nesterov Accelerated Gradient Descent is an optimization algorithm that builds upon the concept of momentum in gradient descent.

In the momentum-based gradient descent article, we updated the parameters using the accumulated gradient velocity.

However, Nesterov’s insight was to compute the gradient not at the current position but at a position slightly ahead in the direction of momentum.

So, by combining the current position with the momentum and computing the gradient, we realize Nesterov’s vision for optimization.

The look-ahead mechanism allows the optimizer to “peek” ahead and evaluate the gradient at a position that accounts for the momentum.

By doing so, NAG can adjust the model parameters more effectively, potentially avoiding overshooting and improving convergence.

NAG harnesses momentum information more efficiently, resulting in faster convergence, greater stability, and better performance in navigating optimization landscapes.

*Mathematical understanding:*

The inclusion of a look a head term to the Gradient Descent can be expressed as follows

where *β* is the momentum parameter 0 < *β* < 1, where *v_t* is the velocity vector at iteration *t *and* x_t *is the step at iteration* t.*

v_*lookahead* represents the “lookahead” position where we evaluate the gradient.

the modified gradient can be expressed as

This lookahead allows the algorithm to anticipate where it will be in the next step and adjust the update accordingly, leading to faster convergence.

Now, let’s update to the final step with velocity vector as shown below

**Advantages:**

Nesterov Accelerated Gradient Descent (NAG) addresses some of the drawbacks of vanilla momentum-based gradient descent by incorporating a “look ahead” mechanism, which helps to alleviate the issues associated with momentum. Here’s how NAG addresses these drawbacks:

*Overshooting:*

NAG reduces overshooting by adjusting the gradient calculation based on the current momentum’s direction. Instead of blindly following the momentum, NAG calculates the gradient not at the current parameter values but at the “look ahead” position, which is ahead in the direction of the momentum. This adjustment allows NAG to anticipate the effect of momentum and adjust the gradient accordingly, reducing overshooting.

*Increased Sensitivity to Learning Rate:*

NAG is less sensitive to the learning rate compared to vanilla momentum-based gradient descent. The “look ahead” mechanism helps in stabilizing convergence by effectively adjusting the step size based on the momentum’s direction. This allows for more robust convergence across a wider range of learning rates.

*Lack of Extra “Look Ahead”:*

NAG incorporates the “look ahead” term, which provides additional information about the gradient’s direction at a future position. By considering this information, NAG can make more informed decisions about the gradient descent step, leading to more stable and efficient convergence, especially in scenarios with high curvature or noisy gradients.

the dotted lines represent the look a head feature of NAG which corrects the path.

python code for momentum based gradient descent

`import numpy as np`

import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D

a quadratic function

`def objective_function(x):`

return x**2

Gradient of the loss function or vector with partial derivatives

`def gradient(x):`

return 2*x

define Nesterov Accelerated Gradient Descent optimization function

`def nesterov_gradient_descent(initial_position, learning_rate, `

momentum, num_iterations):

# Initialize parameters

current_position = initial_position

velocity = 0

# Lists to store the trajectory for plotting

positions = [initial_position]

# Optimization loop

for i in range(num_iterations):

# Calculate the momentum update

v_lookahead = current_position + momentum * velocity

velocity = momentum * velocity - learning_rate * gradient(v_lookahead)

# Update the parameters using the momentum

current_position += velocity

# Store the current position

positions.append(current_position)

return np.array(positions)

Get the required parameters

`# Set parameters`

initial_position = 3.0

learning_rate = 0.1

momentum = 0.9

num_iterations = 10

trajectory = nesterov_gradient_descent(initial_position, learning_rate, momentum, num_iterations)

Visualize the 3D loss landscape and optimization trajectory

`# Plot the objective function in 3D`

fig = plt.figure()

ax = fig.add_subplot(111, projection='3d')

x = np.linspace(-5, 5, 100)

y = objective_function(x)

ax.plot(x, y, zs=0, zdir='y', label='Objective Function')

# Plot the trajectory of the optimization algorithm

z_trajectory = objective_function(trajectory)

ax.scatter(trajectory, z_trajectory, zs=0, zdir='y', color='red', label='Optimization Trajectory')

# Annotate the starting point

ax.scatter(initial_position, objective_function(initial_position), zs=0, zdir='y', color='green', label='Initial Position')

# Add labels and legend

ax.set_xlabel('X')

ax.set_ylabel('Objective Function Value')

ax.set_zlabel('f(X)')

ax.set_title('Nesterov Accelerated Gradient Descent')

ax.legend()

# Show plot

plt.show()

plot the graph

below is the 2d plot for the momentum based gradient descent

below is the animated graph.

**Quantum Algorithm:**

We use qiskit to apply NAG to the quantum data.

import required libraries

`import numpy as np`

from qiskit import QuantumCircuit

from qiskit_aer import Aer

Define objective function and its gradient

`# Objective function and its gradient`

def objective_function(x):

return x**2

def gradient_function(x):

return 2*x

initialize the parameters, circuit parameters and quantum circuit

`x = 2`

learning_rate = 0.1

num_iterations = 10

beta = 0.9 # Momentum coefficient

# Initialize momentum

momentum = 0

# Quantum circuit parameters

num_qubits = 1

shots = 1000

# Initialize quantum circuit

qc = QuantumCircuit(num_qubits, num_qubits)

Define encoding block

`# Quantum gate for encoding parameter`

def encode_parameter(qc, x):

# Scale the parameter to fit within the range [-1, 1]

scaled_x = (2 * np.clip(x, 0, 1)) - 1

qc.ry(2 * np.arcsin(np.sqrt((scaled_x + 1) / 2)), 0)

measurement and simulator block

`# Quantum gate for measuring objective function`

def measure_objective(qc):

qc.measure(0, 0)

# Get simulator backend

simulator = Aer.get_backend('qasm_simulator')

Gradient descent iteration

`# Run NAG iterations`

for i in range(num_iterations):

# Compute gradient at the lookahead position

lookahead_position = x + beta * momentum

lookahead_gradient = gradient_function(lookahead_position)

# Update momentum

momentum = beta * momentum - learning_rate * lookahead_gradient

# Update parameter using NAG

x += momentum

# Reset quantum circuit for each iteration

qc.data = []

# Encode parameter into quantum state

encode_parameter(qc, x)

# Measure objective function

measure_objective(qc)

# Execute the circuit and obtain measurement results

job = simulator.run(qc, shots=shots)

result = job.result()

counts = result.get_counts(qc)

# Estimate objective function value

estimated_value = (counts.get('0', 0) - counts.get('1', 0)) / shots

print(f"""Iteration {i+1}: Parameter = {x},

Objective function value = {estimated_value}""")

# Print final result

print(f"Optimized parameter value: {x}")

Results:

`Iteration 1: Parameter = 1.6, `

Objective function value = -1.0

Iteration 2: Parameter = 0.992,

Objective function value = -0.98

Iteration 3: Parameter = 0.35583999999999993,

Objective function value = 0.33

Iteration 4: Parameter = -0.17336320000000005,

Objective function value = 1.0

Iteration 5: Parameter = -0.5197168640000001,

Objective function value = 1.0

Iteration 6: Parameter = -0.6651481292800001,

Objective function value = 1.0

Iteration 7: Parameter = -0.6368290144256001,

Objective function value = 1.0

Iteration 8: Parameter = -0.4890734488453121,

Objective function value = 1.0

Iteration 9: Parameter = -0.2848747518584423,

Objective function value = 1.0

Iteration 10: Parameter = -0.08087673965620756,

Objective function value = 1.0

Optimized parameter value: -0.08087673965620756

Now let’s apply the quantum circuit using qiskit with iris dataset.

import required modules

`import numpy as np`

from sklearn.datasets import load_iris

from sklearn.preprocessing import MinMaxScaler

from qiskit import QuantumCircuit

from qiskit_aer import Aer

load the iris data using sklearn

`# Load Iris dataset`

iris = load_iris()

X = iris.data

y = iris.target

Normalize the data using Min Max

`# Normalize feature values`

scaler = MinMaxScaler()

X_normalized = scaler.fit_transform(X)

Define Objective and gradient function

`# Define objective function and its gradient`

def objective_function(w):

# Simple objective function:

# sum of squared differences between actual and predicted class labels

predicted_labels = np.argmax(np.dot(X_normalized, w), axis=1)

return np.sum((predicted_labels - y)**2)

def gradient_function(w):

# Gradient of the objective function

predicted_labels = np.dot(X_normalized, w)

error = predicted_labels - np.eye(3)[y]

gradient = 2 * np.dot(X_normalized.T, error)

return gradient

Initialize parameters, quantum parameters and quantum circuit

`# Initialize parameters`

num_features = X_normalized.shape[1]

num_classes = len(np.unique(y))

w = np.random.rand(num_features, num_classes)

learning_rate = 0.01

num_iterations = 10

beta = 0.9 # Momentum coefficient

# Initialize momentum

momentum = np.zeros_like(w)

# Quantum circuit parameters

num_qubits = num_features

shots = 1000

# Initialize quantum circuit

qc = QuantumCircuit(num_qubits, num_qubits)

define encode function and measurement

`# Quantum gate for encoding parameters`

def encode_parameters(qc, w):

for i in range(num_qubits):

for j in range(num_classes):

# Clip values to ensure they fall within [0, 1] range

clipped_value = np.clip(w[i, j], 0, 1)

# Apply arcsin gate

qc.ry(2 * np.arcsin(np.sqrt(clipped_value)), i)

# Quantum gate for measuring objective function

def measure_objective(qc):

qc.measure_all()

NAG iterations

`# Run NAG iterations`

for i in range(num_iterations):

# Compute lookahead gradient

lookahead_position = w + beta * momentum

lookahead_gradient = gradient_function(lookahead_position)

# Update momentum

momentum = beta * momentum - learning_rate * lookahead_gradient

# Update parameters using NAG with lookahead

w += momentum

# Reset quantum circuit for each iteration

qc.data = []

# Encode parameters into quantum state

encode_parameters(qc, w)

# Measure objective function

measure_objective(qc)

# Execute the circuit and obtain measurement results

simulator = Aer.get_backend('qasm_simulator')

job = simulator.run(qc, shots=shots)

result = job.result()

counts = result.get_counts(qc)

# Estimate objective function value

estimated_value = (

counts.get('0' * num_qubits, 0) - counts.get('1' * num_qubits, 0)

) / shots

print(f"""Iteration {i+1}: Parameter = {w},

Objective function value = {estimated_value}""")

# Print final result

print(f"Optimized parameters: {w}")

Results

`Iteration 1: Parameter = [[-0.24374827 -0.9061288 0.24130184]`

[-0.11497702 -0.35900118 -0.35002827]

[-0.78395872 -1.07266658 -0.4334284 ]

[-0.82933315 -0.67215263 0.16625303]],

Objective function value = 0.0

Iteration 2: Parameter = [[2.94848702 4.12393329 1.89087207]

[3.09152589 3.47562979 0.90089741]

[2.66152912 4.74246409 1.51012043]

[2.58849168 5.04187833 2.15980021]],

Objective function value = 0.0

Iteration 3: Parameter = [[-10.24967873 -16.4462836 -5.07660408]

[ -6.94205584 -13.23759166 -4.74236007]

[-12.47630043 -18.53836232 -6.33772821]

[-12.52430114 -18.29937927 -5.59224451]],

Objective function value = 0.0

Iteration 4: Parameter = [[42.97477125 66.8492697 22.85002257]

[36.19368516 53.56477598 17.69974329]

[47.97646051 76.2295617 25.46897623]

[47.75598556 76.05693244 26.2183139 ]],

Objective function value = 0.0

Iteration 5: Parameter = [[-172.88437268 -270.51550942 -90.59222732]

[-137.09066725 -217.55014439 -73.39482852]

[-197.33484109 -307.16645076 -103.42592698]

[-196.89830376 -306.49077917 -102.22883087]],

Objective function value = 0.0

Iteration 6: Parameter = [[ 701.22066949 1096.15045193 368.60201619]

[ 565.21094197 880.52569039 295.64155648]

[ 796.34835613 1246.34199744 418.5635158 ]

[ 794.10833193 1242.62700928 418.46329661]],

Objective function value = 0.0

Iteration 7: Parameter = [[-2839.89396057 -4439.81386737 -1491.83002666]

[-2280.06302665 -3567.41908943 -1199.0537926 ]

[-3228.55836735 -5046.12584959 -1696.09330263]

[-3219.95817656 -5033.11621468 -1690.38413142]],

Objective function value = 0.0

Iteration 8: Parameter = [[11504.0220842 17985.18479166 6044.01496836]

[ 9244.72696626 14450.41176348 4855.86793812]

[13075.82777369 20443.66619801 6869.66441944]

[13040.51144571 20387.78874118 6852.41998805]],

Objective function value = 0.0

Iteration 9: Parameter = [[-46600.21585552 -72853.28319439 -24482.28965385]

[-37440.29088386 -58535.4615944 -19670.98433709]

[-52969.08643013 -82809.14630454 -27828.49543544]

[-52826.53011265 -82587.1706104 -27752.16841873]],

Objective function value = 0.0

Iteration 10: Parameter = [[188766.41148909 295112.54149541 99172.46452635]

[151669.52466075 237113.72083839 79681.86069589]

[214564.04204116 335444.8535549 112725.55869321]

[213986.0269348 334540.08579815 112423.44267571]],

Objective function value = 0.0

Optimized parameters: [[188766.41148909 295112.54149541 99172.46452635]

[151669.52466075 237113.72083839 79681.86069589]

[214564.04204116 335444.8535549 112725.55869321]

[213986.0269348 334540.08579815 112423.44267571]]

**Conclusion:**Nesterov Accelerated Gradient Descent (NAG) improves upon the drawbacks of traditional momentum-based gradient descent methods by incorporating a correction term that reduces overshooting, improves convergence near saddle points, and reduces sensitivity to the choice of hyperparameters like the momentum parameter. This makes NAG a preferred choice in many optimization scenarios, especially in training deep neural networks.

references: IIT-M course, Wikipedia, google colab, stack exchange, open sources.