# QAlog: Quantum Adaptive Gradient

## How to empower GD with Adaptive Learning Rates?

In the previous articles, we have discussed about Gradient Descent, momentum based gradient descent, Nesterov Accelerated Gradient and respective quantum algorithms.

In this article, we discuss about how to dynamically adjust the learning rate using adaptive gradient.

**Introduction:**

In the previous article, we discussed about SGD. Although SGD overcomes some significant drawbacks of GD, it also exhibits many problems.

One problem is that stochastic gradients usually have a high variance. As a consequence, the opposing directions of stochastic gradients oscillate widely and are not descent directions.

This contrasts with GD, which continually moves towards the optimal value in the direction opposite to the gradients. This article also discusses mini-batch stochastic gradient descent as a potential solution.

Another problem with SGD is that it is incredibly challenging to set up a proper learning rate due to its stochastic nature. In Gradient Descent, we can adaptively set up learning rates using line search methods because the opposing directions of gradients are always descent directions. However, the line search method is not applicable for stochastic gradients. In addition, it is inappropriate to use the same learning rate for all parameters.

In scenarios involving sparse data, where features occur at varying frequencies, it’s unwise to update corresponding variables with a same learning rate. Less frequently occurring features typically necessitate a higher learning rate.

As we discussed and derived the learning rate(α) in the Gradient descent article as shown below,

Upon close observation, the learning rate α shows a direct correlation with the variables *x* and Δ*x*, both of which fluctuate with each iteration. This implies that the learning rate must adjust accordingly with each iteration or with every change in the value of *x*.

Let’s modify the expression for a dynamic learning rate were the gradient changes at each iteration.

If we observe carefully, this approach normalizes by the square of the gradient norm at each iteration as shown below.

*Problem:*

The problem with the square of the gradient norm is “Abrupt shifts in the gradient can lead to significant variations in the learning rate.”

*Solution:*

Accumulating gradient information(means involve previous gradients in the current calculation) helps in stabilizing the learning process. If certain parameters consistently have large gradients, directly using them in the updates can lead to instability.

*How it works:*

Adagrad normalizes by the accumulated sum of the squares of the gradients. Due to which learning rate typically decreases overtime due to accumulation of gradient squares, which helps in more stable and controlled updates.

**Adaptive Gradient(Adagrad):**

Adagrad** **adapts the learning rate dynamically based on information of gradients from previous iterations. which can be expressed as shown below

the update rule can be expressed as

But there are few challenges with the above approach

When using the accumulated sum of squared gradients directly to adjust the learning rate, the learning rate can shrink too rapidly. This is because the sum of squared gradients grows quickly over time, especially with large or consistent gradients. As a result, the learning rate might become excessively small, leading to very slow learning and making it hard for the model to converge efficiently.*Damping Effect:*Squared gradients can become very large, particularly in high-dimensional data or when gradients are frequently large. If we directly use the sum of squared gradients to adjust the learning rate, the learning rate might diminish to a very small value. This would lead to overly cautious updates and can prevent the algorithm from learning effectively.*Normalization:*

** Solution:** Adagrad uses the square root of the accumulated squared gradients to moderate the learning rate adjustment. This slows the growth of the denominator, leading to a more gradual decrease in the learning rate. By growing more slowly than the sum itself, the square root ensures stable and consistent updates, preventing the learning rate from becoming excessively small and maintaining an effective learning process.

To prevent division by zero, the above expression can be modified as shown below

where* ϵ* is a small constant added for numerical stability.

A plot to visualize adagrad with sample data

Let’s take an example understand how adagrad works:

Suppose

xis a single parameter, initialized atx_0= 0, with a learning rate α = 0.1, and the following gradients over time, and let’s denote gradient asgand accumulated gradient asGfor simplicity,

g1= 0.6,g2= 0.4,g3= 0.5

Time step 1:

- Gradient:
*g1*= 0.6 - Accumulated gradient:
*G1*= 0.6² = 0.36 - Adjusted learning rate:

- Parameter update:
*x1*=*x0*− 0.1667 * 0.6 = 0 − 0.1667 * 0.6 = −0.1

Time step 2:

- Gradient:
*g*2 = 0.4 - Accumulated gradient:
*G*2 = 0.36 + 0.4² = 0.52 - Adjusted learning rate:

- Parameter update:
*x2*=*x1*− 0.1387 * 0.4 = — 0.1 − 0.0548 = — 0.1548

Time step 3:

- Gradient:
*g*3 = 0.5 - Accumulated gradient:
*G*3 = 0.52 + 0.5² = 0.77 - Adjusted learning rate:

- Parameter update:
*x*3 =*x2*− 0.1139 * 0.5 = — 0.1548 − 0.0569 = — 0.2117

This process continues iteratively, adjusting the learning rates and updating the parameters based on the accumulated gradients.

*Adagrad Momentum:*

Combining Adagrad with momentum can help to smooth out the updates while adapting the learning rate.

Momentum based gradient descent expression

AdagradMomentum

A plot to visualize adagrad-momentum with sample data

*Nesterov-Adagrad*

similarly, Combining Adagrad with Nesterov accelerated gradient can provide a look-ahead mechanism while adapting the learning rate.

NAG expression

Nesterov-Adagrad expression can be shown as below

A plot to visualize adagrad-NAG with sample data

Let’s see the comparison of Adagrad, Adagrad-momentum, Adagrad-NAG, Adagrad-SGD, Adagrad-minibatch

**Quantum Algorithm:**

We use qiskit to apply gradient descent to the quantum data.

import required libraries

`import numpy as np`

from qiskit import QuantumCircuit

from qiskit_aer import Aer

Define objective function and its gradient

`# Objective function and its gradient`

def objective_function(x):

return x**2

def gradient_function(x):

return 2*x

initialize the parameters, circuit parameters and quantum circuit

`# Initialize parameters`

x = 2

lr = 0.1 # learning rate

epsilon = 1e-8 # Small constant to prevent division by zero

num_iterations = 10

# Adagrad parameters

sum_gradients_sq = 0

# Quantum circuit parameters

num_qubits = 1

shots = 1000

# Initialize quantum circuit

qc = QuantumCircuit(num_qubits, num_qubits)

Define encoding block

`# Quantum gate for encoding parameter`

def encode_parameter(qc, x):

# Scale the parameter to fit within the range [-1, 1]

scaled_x = (2 * np.clip(x, 0, 1)) - 1

qc.ry(2 * np.arcsin(np.sqrt((scaled_x + 1) / 2)), 0)

measurement and simulator block

`# Quantum gate for measuring objective function`

def measure_objective(qc):

qc.measure(0, 0)

# Get simulator backend

simulator = Aer.get_backend('qasm_simulator')

Gradient descent iteration

`# Run gradient descent iterations`

for i in range(num_iterations):

# Compute gradient

gradient = gradient_function(x)

# Update sum of squared gradients

sum_gradients_sq += gradient ** 2

# Compute adaptive learning rate

adaptive_learning_rate = lr / np.sqrt(sum_gradients_sq + epsilon)

# Update parameter using Adagrad

x -= adaptive_learning_rate * gradient

# Reset quantum circuit for each iteration

qc.data = []

# Encode parameter into quantum state

encode_parameter(qc, x)

# Measure objective function

measure_objective(qc)

# Execute the circuit and obtain measurement results

job = simulator.run(qc, shots=shots)

result = job.result()

counts = result.get_counts(qc)

# Estimate objective function value

estimated_value = (counts.get('0', 0) - counts.get('1', 0)) / shots

print(f"""Iteration {i+1}: Parameter = {x},

Objective function value = {estimated_value}""")

# Print final result

print(f"Optimized parameter value: {x}")

Results:

`Iteration 1: Parameter = 1.90000000003125, `

Objective function value = -1.0

Iteration 2: Parameter = 1.8311250538504986,

Objective function value = -1.0

Iteration 3: Parameter = 1.7758215150562742,

Objective function value = -1.0

Iteration 4: Parameter = 1.7285570378465538,

Objective function value = -1.0

Iteration 5: Parameter = 1.6867615981817918,

Objective function value = -1.0

Iteration 6: Parameter = 1.6489968818891936,

Objective function value = -1.0

Iteration 7: Parameter = 1.6143626690650583,

Objective function value = -1.0

Iteration 8: Parameter = 1.5822515391978071,

Objective function value = -1.0

Iteration 9: Parameter = 1.5522308241118843,

Objective function value = -1.0

Iteration 10: Parameter = 1.5239794475687682,

Objective function value = -1.0

Optimized parameter value: 1.5239794475687682

Now let’s do a comparison between Adagrad, GD, mini-batch GD, Momentum based GD, NAG, SGD as shown below

Animated Image with iterations:

Below is the data generated and use for plots

` Optimizer Iteration Parameter Objective_Function_Value`

0 Gradient Descent 1 1.600000 -1.000

1 Momentum-Based GD 1 0.800000 -0.602

2 Nesterov Accelerated GD 1 0.310000 0.360

3 SGD 1 -0.090000 1.000

4 Mini-batch GD 1 -0.490000 1.000

5 Adagrad 1 -0.590000 1.000

6 Gradient Descent 2 -0.472000 1.000

7 Momentum-Based GD 2 -0.920000 1.000

8 Nesterov Accelerated GD 2 -0.892000 1.000

9 SGD 2 -0.774000 1.000

10 Mini-batch GD 2 -1.174000 1.000

11 Adagrad 2 -1.145705 1.000

12 Gradient Descent 3 -0.916564 1.000

13 Momentum-Based GD 3 -0.810542 1.000

14 Nesterov Accelerated GD 3 -0.671401 1.000

15 SGD 3 -0.442260 1.000

16 Mini-batch GD 3 -0.324260 1.000

17 Adagrad 3 -0.276106 1.000

18 Gradient Descent 4 -0.220884 1.000

19 Momentum-Based GD 4 -0.003942 1.000

20 Nesterov Accelerated GD 4 -0.038720 1.000

21 SGD 4 0.016501 0.976

22 Mini-batch GD 4 0.245642 0.514

23 Adagrad 4 0.257169 0.436

24 Gradient Descent 5 0.205735 0.612

25 Momentum-Based GD 5 0.283561 0.432

26 Nesterov Accelerated GD 5 0.142127 0.672

27 SGD 5 0.090694 0.822

28 Mini-batch GD 5 0.145915 0.730

29 Adagrad 5 0.135239 0.708

30 Gradient Descent 6 0.108191 0.794

31 Momentum-Based GD 6 0.112506 0.764

32 Nesterov Accelerated GD 6 -0.004542 1.000

33 SGD 6 -0.031590 1.000

34 Mini-batch GD 6 -0.083024 1.000

35 Adagrad 6 -0.088629 1.000

36 Gradient Descent 7 -0.070903 1.000

37 Momentum-Based GD 7 -0.034391 1.000

38 Nesterov Accelerated GD 7 -0.106666 1.000

39 SGD 7 -0.088940 1.000

40 Mini-batch GD 7 -0.115988 1.000

41 Adagrad 7 -0.112317 1.000

42 Gradient Descent 8 -0.089853 1.000

43 Momentum-Based GD 8 -0.013757 1.000

44 Nesterov Accelerated GD 8 -0.081293 1.000

45 SGD 8 -0.058830 1.000

46 Mini-batch GD 8 -0.041104 1.000

47 Adagrad 8 -0.036457 1.000

48 Gradient Descent 9 -0.029166 1.000

49 Momentum-Based GD 9 0.049077 0.886

50 Nesterov Accelerated GD 9 -0.033631 1.000

51 SGD 9 -0.026340 1.000

52 Mini-batch GD 9 -0.003877 1.000

53 Adagrad 9 -0.002368 1.000

54 Gradient Descent 10 -0.001895 1.000

55 Momentum-Based GD 10 0.063086 0.864

56 Nesterov Accelerated GD 10 -0.026441 1.000

57 SGD 10 -0.025967 1.000

58 Mini-batch GD 10 -0.018676 1.000

59 Adagrad 10 -0.018578 1.000

**references:** qiskit textbook, open source materials, deep learning course(IIT M).