QAlog: Quantum Adaptive Gradient
How to empower GD with Adaptive Learning Rates?
In the previous articles, we have discussed about Gradient Descent, momentum based gradient descent, Nesterov Accelerated Gradient and respective quantum algorithms.
In this article, we discuss about how to dynamically adjust the learning rate using adaptive gradient.
Introduction:
In the previous article, we discussed about SGD. Although SGD overcomes some significant drawbacks of GD, it also exhibits many problems.
One problem is that stochastic gradients usually have a high variance. As a consequence, the opposing directions of stochastic gradients oscillate widely and are not descent directions.
This contrasts with GD, which continually moves towards the optimal value in the direction opposite to the gradients. This article also discusses mini-batch stochastic gradient descent as a potential solution.
Another problem with SGD is that it is incredibly challenging to set up a proper learning rate due to its stochastic nature. In Gradient Descent, we can adaptively set up learning rates using line search methods because the opposing directions of gradients are always descent directions. However, the line search method is not applicable for stochastic gradients. In addition, it is inappropriate to use the same learning rate for all parameters.
In scenarios involving sparse data, where features occur at varying frequencies, it’s unwise to update corresponding variables with a same learning rate. Less frequently occurring features typically necessitate a higher learning rate.
As we discussed and derived the learning rate(α) in the Gradient descent article as shown below,
Upon close observation, the learning rate α shows a direct correlation with the variables x and Δx, both of which fluctuate with each iteration. This implies that the learning rate must adjust accordingly with each iteration or with every change in the value of x.
Let’s modify the expression for a dynamic learning rate were the gradient changes at each iteration.
If we observe carefully, this approach normalizes by the square of the gradient norm at each iteration as shown below.
Problem:
The problem with the square of the gradient norm is “Abrupt shifts in the gradient can lead to significant variations in the learning rate.”
Solution:
Accumulating gradient information(means involve previous gradients in the current calculation) helps in stabilizing the learning process. If certain parameters consistently have large gradients, directly using them in the updates can lead to instability.
How it works:
Adagrad normalizes by the accumulated sum of the squares of the gradients. Due to which learning rate typically decreases overtime due to accumulation of gradient squares, which helps in more stable and controlled updates.
Adaptive Gradient(Adagrad):
Adagrad adapts the learning rate dynamically based on information of gradients from previous iterations. which can be expressed as shown below
the update rule can be expressed as
But there are few challenges with the above approach
- Damping Effect: When using the accumulated sum of squared gradients directly to adjust the learning rate, the learning rate can shrink too rapidly. This is because the sum of squared gradients grows quickly over time, especially with large or consistent gradients. As a result, the learning rate might become excessively small, leading to very slow learning and making it hard for the model to converge efficiently.
- Normalization: Squared gradients can become very large, particularly in high-dimensional data or when gradients are frequently large. If we directly use the sum of squared gradients to adjust the learning rate, the learning rate might diminish to a very small value. This would lead to overly cautious updates and can prevent the algorithm from learning effectively.
Solution: Adagrad uses the square root of the accumulated squared gradients to moderate the learning rate adjustment. This slows the growth of the denominator, leading to a more gradual decrease in the learning rate. By growing more slowly than the sum itself, the square root ensures stable and consistent updates, preventing the learning rate from becoming excessively small and maintaining an effective learning process.
To prevent division by zero, the above expression can be modified as shown below
where ϵ is a small constant added for numerical stability.
A plot to visualize adagrad with sample data
Let’s take an example understand how adagrad works:
Suppose x is a single parameter, initialized at x_0 = 0, with a learning rate α = 0.1, and the following gradients over time, and let’s denote gradient as g and accumulated gradient as G for simplicity,
g1 = 0.6, g2 = 0.4, g3 = 0.5
Time step 1:
- Gradient: g1 = 0.6
- Accumulated gradient: G1= 0.6² = 0.36
- Adjusted learning rate:
- Parameter update: x1 = x0 − 0.1667 * 0.6 = 0 − 0.1667 * 0.6 = −0.1
Time step 2:
- Gradient: g2 = 0.4
- Accumulated gradient: G2 = 0.36 + 0.4² = 0.52
- Adjusted learning rate:
- Parameter update: x2 = x1 − 0.1387 * 0.4 = — 0.1 − 0.0548 = — 0.1548
Time step 3:
- Gradient: g3 = 0.5
- Accumulated gradient: G3 = 0.52 + 0.5² = 0.77
- Adjusted learning rate:
- Parameter update: x3 = x2 − 0.1139 * 0.5 = — 0.1548 − 0.0569 = — 0.2117
This process continues iteratively, adjusting the learning rates and updating the parameters based on the accumulated gradients.
Adagrad Momentum:
Combining Adagrad with momentum can help to smooth out the updates while adapting the learning rate.
Momentum based gradient descent expression
AdagradMomentum
A plot to visualize adagrad-momentum with sample data
Nesterov-Adagrad
similarly, Combining Adagrad with Nesterov accelerated gradient can provide a look-ahead mechanism while adapting the learning rate.
NAG expression
Nesterov-Adagrad expression can be shown as below
A plot to visualize adagrad-NAG with sample data
Let’s see the comparison of Adagrad, Adagrad-momentum, Adagrad-NAG, Adagrad-SGD, Adagrad-minibatch
Quantum Algorithm:
We use qiskit to apply gradient descent to the quantum data.
import required libraries
import numpy as np
from qiskit import QuantumCircuit
from qiskit_aer import Aer
Define objective function and its gradient
# Objective function and its gradient
def objective_function(x):
return x**2
def gradient_function(x):
return 2*x
initialize the parameters, circuit parameters and quantum circuit
# Initialize parameters
x = 2
lr = 0.1 # learning rate
epsilon = 1e-8 # Small constant to prevent division by zero
num_iterations = 10
# Adagrad parameters
sum_gradients_sq = 0
# Quantum circuit parameters
num_qubits = 1
shots = 1000
# Initialize quantum circuit
qc = QuantumCircuit(num_qubits, num_qubits)
Define encoding block
# Quantum gate for encoding parameter
def encode_parameter(qc, x):
# Scale the parameter to fit within the range [-1, 1]
scaled_x = (2 * np.clip(x, 0, 1)) - 1
qc.ry(2 * np.arcsin(np.sqrt((scaled_x + 1) / 2)), 0)
measurement and simulator block
# Quantum gate for measuring objective function
def measure_objective(qc):
qc.measure(0, 0)
# Get simulator backend
simulator = Aer.get_backend('qasm_simulator')
Gradient descent iteration
# Run gradient descent iterations
for i in range(num_iterations):
# Compute gradient
gradient = gradient_function(x)
# Update sum of squared gradients
sum_gradients_sq += gradient ** 2
# Compute adaptive learning rate
adaptive_learning_rate = lr / np.sqrt(sum_gradients_sq + epsilon)
# Update parameter using Adagrad
x -= adaptive_learning_rate * gradient
# Reset quantum circuit for each iteration
qc.data = []
# Encode parameter into quantum state
encode_parameter(qc, x)
# Measure objective function
measure_objective(qc)
# Execute the circuit and obtain measurement results
job = simulator.run(qc, shots=shots)
result = job.result()
counts = result.get_counts(qc)
# Estimate objective function value
estimated_value = (counts.get('0', 0) - counts.get('1', 0)) / shots
print(f"""Iteration {i+1}: Parameter = {x},
Objective function value = {estimated_value}""")
# Print final result
print(f"Optimized parameter value: {x}")
Results:
Iteration 1: Parameter = 1.90000000003125,
Objective function value = -1.0
Iteration 2: Parameter = 1.8311250538504986,
Objective function value = -1.0
Iteration 3: Parameter = 1.7758215150562742,
Objective function value = -1.0
Iteration 4: Parameter = 1.7285570378465538,
Objective function value = -1.0
Iteration 5: Parameter = 1.6867615981817918,
Objective function value = -1.0
Iteration 6: Parameter = 1.6489968818891936,
Objective function value = -1.0
Iteration 7: Parameter = 1.6143626690650583,
Objective function value = -1.0
Iteration 8: Parameter = 1.5822515391978071,
Objective function value = -1.0
Iteration 9: Parameter = 1.5522308241118843,
Objective function value = -1.0
Iteration 10: Parameter = 1.5239794475687682,
Objective function value = -1.0
Optimized parameter value: 1.5239794475687682
Now let’s do a comparison between Adagrad, GD, mini-batch GD, Momentum based GD, NAG, SGD as shown below
Animated Image with iterations:
Below is the data generated and use for plots
Optimizer Iteration Parameter Objective_Function_Value
0 Gradient Descent 1 1.600000 -1.000
1 Momentum-Based GD 1 0.800000 -0.602
2 Nesterov Accelerated GD 1 0.310000 0.360
3 SGD 1 -0.090000 1.000
4 Mini-batch GD 1 -0.490000 1.000
5 Adagrad 1 -0.590000 1.000
6 Gradient Descent 2 -0.472000 1.000
7 Momentum-Based GD 2 -0.920000 1.000
8 Nesterov Accelerated GD 2 -0.892000 1.000
9 SGD 2 -0.774000 1.000
10 Mini-batch GD 2 -1.174000 1.000
11 Adagrad 2 -1.145705 1.000
12 Gradient Descent 3 -0.916564 1.000
13 Momentum-Based GD 3 -0.810542 1.000
14 Nesterov Accelerated GD 3 -0.671401 1.000
15 SGD 3 -0.442260 1.000
16 Mini-batch GD 3 -0.324260 1.000
17 Adagrad 3 -0.276106 1.000
18 Gradient Descent 4 -0.220884 1.000
19 Momentum-Based GD 4 -0.003942 1.000
20 Nesterov Accelerated GD 4 -0.038720 1.000
21 SGD 4 0.016501 0.976
22 Mini-batch GD 4 0.245642 0.514
23 Adagrad 4 0.257169 0.436
24 Gradient Descent 5 0.205735 0.612
25 Momentum-Based GD 5 0.283561 0.432
26 Nesterov Accelerated GD 5 0.142127 0.672
27 SGD 5 0.090694 0.822
28 Mini-batch GD 5 0.145915 0.730
29 Adagrad 5 0.135239 0.708
30 Gradient Descent 6 0.108191 0.794
31 Momentum-Based GD 6 0.112506 0.764
32 Nesterov Accelerated GD 6 -0.004542 1.000
33 SGD 6 -0.031590 1.000
34 Mini-batch GD 6 -0.083024 1.000
35 Adagrad 6 -0.088629 1.000
36 Gradient Descent 7 -0.070903 1.000
37 Momentum-Based GD 7 -0.034391 1.000
38 Nesterov Accelerated GD 7 -0.106666 1.000
39 SGD 7 -0.088940 1.000
40 Mini-batch GD 7 -0.115988 1.000
41 Adagrad 7 -0.112317 1.000
42 Gradient Descent 8 -0.089853 1.000
43 Momentum-Based GD 8 -0.013757 1.000
44 Nesterov Accelerated GD 8 -0.081293 1.000
45 SGD 8 -0.058830 1.000
46 Mini-batch GD 8 -0.041104 1.000
47 Adagrad 8 -0.036457 1.000
48 Gradient Descent 9 -0.029166 1.000
49 Momentum-Based GD 9 0.049077 0.886
50 Nesterov Accelerated GD 9 -0.033631 1.000
51 SGD 9 -0.026340 1.000
52 Mini-batch GD 9 -0.003877 1.000
53 Adagrad 9 -0.002368 1.000
54 Gradient Descent 10 -0.001895 1.000
55 Momentum-Based GD 10 0.063086 0.864
56 Nesterov Accelerated GD 10 -0.026441 1.000
57 SGD 10 -0.025967 1.000
58 Mini-batch GD 10 -0.018676 1.000
59 Adagrad 10 -0.018578 1.000
references: qiskit textbook, open source materials, deep learning course(IIT M).