QAlog: Quantum Adaptive Gradient

How to empower GD with Adaptive Learning Rates?

9 min readMay 20, 2024

In the previous articles, we have discussed about Gradient Descent, momentum based gradient descent, Nesterov Accelerated Gradient and respective quantum algorithms.

In this article, we discuss about how to dynamically adjust the learning rate using adaptive gradient.

Introduction:

In the previous article, we discussed about SGD. Although SGD overcomes some significant drawbacks of GD, it also exhibits many problems.

One problem is that stochastic gradients usually have a high variance. As a consequence, the opposing directions of stochastic gradients oscillate widely and are not descent directions.

This contrasts with GD, which continually moves towards the optimal value in the direction opposite to the gradients. This article also discusses mini-batch stochastic gradient descent as a potential solution.

Another problem with SGD is that it is incredibly challenging to set up a proper learning rate due to its stochastic nature. In Gradient Descent, we can adaptively set up learning rates using line search methods because the opposing directions of gradients are always descent directions. However, the line search method is not applicable for stochastic gradients. In addition, it is inappropriate to use the same learning rate for all parameters.

In scenarios involving sparse data, where features occur at varying frequencies, it’s unwise to update corresponding variables with a same learning rate. Less frequently occurring features typically necessitate a higher learning rate.

As we discussed and derived the learning rate(α) in the Gradient descent article as shown below,

Upon close observation, the learning rate α shows a direct correlation with the variables x and Δx, both of which fluctuate with each iteration. This implies that the learning rate must adjust accordingly with each iteration or with every change in the value of x.

Let’s modify the expression for a dynamic learning rate were the gradient changes at each iteration.

If we observe carefully, this approach normalizes by the square of the gradient norm at each iteration as shown below.

Problem:

The problem with the square of the gradient norm is “Abrupt shifts in the gradient can lead to significant variations in the learning rate.”

Solution:

Accumulating gradient information(means involve previous gradients in the current calculation) helps in stabilizing the learning process. If certain parameters consistently have large gradients, directly using them in the updates can lead to instability.

How it works:

Adagrad normalizes by the accumulated sum of the squares of the gradients. Due to which learning rate typically decreases overtime due to accumulation of gradient squares, which helps in more stable and controlled updates.

Adaptive Gradient(Adagrad):

Adagrad adapts the learning rate dynamically based on information of gradients from previous iterations. which can be expressed as shown below

the update rule can be expressed as

But there are few challenges with the above approach

Damping Effect: When using the accumulated sum of squared gradients directly to adjust the learning rate, the learning rate can shrink too rapidly. This is because the sum of squared gradients grows quickly over time, especially with large or consistent gradients. As a result, the learning rate might become excessively small, leading to very slow learning and making it hard for the model to converge efficiently.
Normalization: Squared gradients can become very large, particularly in high-dimensional data or when gradients are frequently large. If we directly use the sum of squared gradients to adjust the learning rate, the learning rate might diminish to a very small value. This would lead to overly cautious updates and can prevent the algorithm from learning effectively.

Solution: Adagrad uses the square root of the accumulated squared gradients to moderate the learning rate adjustment. This slows the growth of the denominator, leading to a more gradual decrease in the learning rate. By growing more slowly than the sum itself, the square root ensures stable and consistent updates, preventing the learning rate from becoming excessively small and maintaining an effective learning process.

To prevent division by zero, the above expression can be modified as shown below

where ϵ is a small constant added for numerical stability.

A plot to visualize adagrad with sample data

Let’s take an example understand how adagrad works:
Suppose x is a single parameter, initialized at x_0 = 0, with a learning rate α = 0.1, and the following gradients over time, and let’s denote gradient as g and accumulated gradient as G for simplicity,
g1 = 0.6, g2 = 0.4, g3 = 0.5

Time step 1:

Gradient: g1 = 0.6
Accumulated gradient: G1= 0.6² = 0.36
Adjusted learning rate:

Parameter update: x1 = x0 − 0.1667 * 0.6 = 0 − 0.1667 * 0.6 = −0.1

Time step 2:

Gradient: g2 = 0.4
Accumulated gradient: G2 = 0.36 + 0.4² = 0.52
Adjusted learning rate:

Parameter update: x2 = x1 − 0.1387 * 0.4 = — 0.1 − 0.0548 = — 0.1548

Time step 3:

Gradient: g3 = 0.5
Accumulated gradient: G3 = 0.52 + 0.5² = 0.77
Adjusted learning rate:

Parameter update: x3 = x2 − 0.1139 * 0.5 = — 0.1548 − 0.0569 = — 0.2117

This process continues iteratively, adjusting the learning rates and updating the parameters based on the accumulated gradients.

Adagrad Momentum:

Combining Adagrad with momentum can help to smooth out the updates while adapting the learning rate.

Momentum based gradient descent expression

AdagradMomentum

A plot to visualize adagrad-momentum with sample data

Nesterov-Adagrad

similarly, Combining Adagrad with Nesterov accelerated gradient can provide a look-ahead mechanism while adapting the learning rate.

NAG expression

Nesterov-Adagrad expression can be shown as below

A plot to visualize adagrad-NAG with sample data

Let’s see the comparison of Adagrad, Adagrad-momentum, Adagrad-NAG, Adagrad-SGD, Adagrad-minibatch

Quantum Algorithm:

We use qiskit to apply gradient descent to the quantum data.

import required libraries

import numpy as np
from qiskit import QuantumCircuit
from qiskit_aer import Aer

Define objective function and its gradient

# Objective function and its gradient
def objective_function(x):
    return x**2

def gradient_function(x):
    return 2*x

initialize the parameters, circuit parameters and quantum circuit

# Initialize parameters
x = 2
lr = 0.1 # learning rate
epsilon = 1e-8  # Small constant to prevent division by zero
num_iterations = 10

# Adagrad parameters
sum_gradients_sq = 0

# Quantum circuit parameters
num_qubits = 1
shots = 1000

# Initialize quantum circuit
qc = QuantumCircuit(num_qubits, num_qubits)

Define encoding block

# Quantum gate for encoding parameter
def encode_parameter(qc, x):
    # Scale the parameter to fit within the range [-1, 1]
    scaled_x = (2 * np.clip(x, 0, 1)) - 1
    qc.ry(2 * np.arcsin(np.sqrt((scaled_x + 1) / 2)), 0)

measurement and simulator block

# Quantum gate for measuring objective function
def measure_objective(qc):
    qc.measure(0, 0)

# Get simulator backend
simulator = Aer.get_backend('qasm_simulator')

Gradient descent iteration

# Run gradient descent iterations
for i in range(num_iterations):
    # Compute gradient
    gradient = gradient_function(x)
    
    # Update sum of squared gradients
    sum_gradients_sq += gradient ** 2
    
    # Compute adaptive learning rate
    adaptive_learning_rate = lr / np.sqrt(sum_gradients_sq + epsilon)
    
    # Update parameter using Adagrad
    x -= adaptive_learning_rate * gradient

    # Reset quantum circuit for each iteration
    qc.data = []

    # Encode parameter into quantum state
    encode_parameter(qc, x)
    
    # Measure objective function
    measure_objective(qc)

    # Execute the circuit and obtain measurement results
    job = simulator.run(qc, shots=shots)
    result = job.result()
    counts = result.get_counts(qc)

    # Estimate objective function value
    estimated_value = (counts.get('0', 0) - counts.get('1', 0)) / shots

    print(f"""Iteration {i+1}: Parameter = {x}, 
              Objective function value = {estimated_value}""")

# Print final result
print(f"Optimized parameter value: {x}")

Results:

Iteration 1: Parameter = 1.90000000003125, 
              Objective function value = -1.0
Iteration 2: Parameter = 1.8311250538504986, 
              Objective function value = -1.0
Iteration 3: Parameter = 1.7758215150562742, 
              Objective function value = -1.0
Iteration 4: Parameter = 1.7285570378465538, 
              Objective function value = -1.0
Iteration 5: Parameter = 1.6867615981817918, 
              Objective function value = -1.0
Iteration 6: Parameter = 1.6489968818891936, 
              Objective function value = -1.0
Iteration 7: Parameter = 1.6143626690650583, 
              Objective function value = -1.0
Iteration 8: Parameter = 1.5822515391978071, 
              Objective function value = -1.0
Iteration 9: Parameter = 1.5522308241118843, 
              Objective function value = -1.0
Iteration 10: Parameter = 1.5239794475687682, 
              Objective function value = -1.0

Optimized parameter value: 1.5239794475687682

Now let’s do a comparison between Adagrad, GD, mini-batch GD, Momentum based GD, NAG, SGD as shown below

Animated Image with iterations:

Below is the data generated and use for plots

                   Optimizer  Iteration  Parameter  Objective_Function_Value
0          Gradient Descent          1   1.600000                    -1.000
1         Momentum-Based GD          1   0.800000                    -0.602
2   Nesterov Accelerated GD          1   0.310000                     0.360
3                       SGD          1  -0.090000                     1.000
4             Mini-batch GD          1  -0.490000                     1.000
5                   Adagrad          1  -0.590000                     1.000
6          Gradient Descent          2  -0.472000                     1.000
7         Momentum-Based GD          2  -0.920000                     1.000
8   Nesterov Accelerated GD          2  -0.892000                     1.000
9                       SGD          2  -0.774000                     1.000
10            Mini-batch GD          2  -1.174000                     1.000
11                  Adagrad          2  -1.145705                     1.000
12         Gradient Descent          3  -0.916564                     1.000
13        Momentum-Based GD          3  -0.810542                     1.000
14  Nesterov Accelerated GD          3  -0.671401                     1.000
15                      SGD          3  -0.442260                     1.000
16            Mini-batch GD          3  -0.324260                     1.000
17                  Adagrad          3  -0.276106                     1.000
18         Gradient Descent          4  -0.220884                     1.000
19        Momentum-Based GD          4  -0.003942                     1.000
20  Nesterov Accelerated GD          4  -0.038720                     1.000
21                      SGD          4   0.016501                     0.976
22            Mini-batch GD          4   0.245642                     0.514
23                  Adagrad          4   0.257169                     0.436
24         Gradient Descent          5   0.205735                     0.612
25        Momentum-Based GD          5   0.283561                     0.432
26  Nesterov Accelerated GD          5   0.142127                     0.672
27                      SGD          5   0.090694                     0.822
28            Mini-batch GD          5   0.145915                     0.730
29                  Adagrad          5   0.135239                     0.708
30         Gradient Descent          6   0.108191                     0.794
31        Momentum-Based GD          6   0.112506                     0.764
32  Nesterov Accelerated GD          6  -0.004542                     1.000
33                      SGD          6  -0.031590                     1.000
34            Mini-batch GD          6  -0.083024                     1.000
35                  Adagrad          6  -0.088629                     1.000
36         Gradient Descent          7  -0.070903                     1.000
37        Momentum-Based GD          7  -0.034391                     1.000
38  Nesterov Accelerated GD          7  -0.106666                     1.000
39                      SGD          7  -0.088940                     1.000
40            Mini-batch GD          7  -0.115988                     1.000
41                  Adagrad          7  -0.112317                     1.000
42         Gradient Descent          8  -0.089853                     1.000
43        Momentum-Based GD          8  -0.013757                     1.000
44  Nesterov Accelerated GD          8  -0.081293                     1.000
45                      SGD          8  -0.058830                     1.000
46            Mini-batch GD          8  -0.041104                     1.000
47                  Adagrad          8  -0.036457                     1.000
48         Gradient Descent          9  -0.029166                     1.000
49        Momentum-Based GD          9   0.049077                     0.886
50  Nesterov Accelerated GD          9  -0.033631                     1.000
51                      SGD          9  -0.026340                     1.000
52            Mini-batch GD          9  -0.003877                     1.000
53                  Adagrad          9  -0.002368                     1.000
54         Gradient Descent         10  -0.001895                     1.000
55        Momentum-Based GD         10   0.063086                     0.864
56  Nesterov Accelerated GD         10  -0.026441                     1.000
57                      SGD         10  -0.025967                     1.000
58            Mini-batch GD         10  -0.018676                     1.000
59                  Adagrad         10  -0.018578                     1.000

references: qiskit textbook, open source materials, deep learning course(IIT M).

QAlog: Quantum Adaptive Gradient

How to empower GD with Adaptive Learning Rates?

Written by Anonymousket