## How to empower GD with Adaptive Learning Rates?

Introduction:

In the previous article, we discussed about SGD. Although SGD overcomes some significant drawbacks of GD, it also exhibits many problems.

One problem is that stochastic gradients usually have a high variance. As a consequence, the opposing directions of stochastic gradients oscillate widely and are not descent directions.

This contrasts with GD, which continually moves towards the optimal value in the direction opposite to the gradients. This article also discusses mini-batch stochastic gradient descent as a potential solution.

Another problem with SGD is that it is incredibly challenging to set up a proper learning rate due to its stochastic nature. In Gradient Descent, we can adaptively set up learning rates using line search methods because the opposing directions of gradients are always descent directions. However, the line search method is not applicable for stochastic gradients. In addition, it is inappropriate to use the same learning rate for all parameters.

In scenarios involving sparse data, where features occur at varying frequencies, it’s unwise to update corresponding variables with a same learning rate. Less frequently occurring features typically necessitate a higher learning rate.

As we discussed and derived the learning rate(α) in the Gradient descent article as shown below,

Upon close observation, the learning rate α shows a direct correlation with the variables x and Δx, both of which fluctuate with each iteration. This implies that the learning rate must adjust accordingly with each iteration or with every change in the value of x.

Let’s modify the expression for a dynamic learning rate were the gradient changes at each iteration.

If we observe carefully, this approach normalizes by the square of the gradient norm at each iteration as shown below.

Problem:

The problem with the square of the gradient norm is “Abrupt shifts in the gradient can lead to significant variations in the learning rate.”

Solution:

Accumulating gradient information(means involve previous gradients in the current calculation) helps in stabilizing the learning process. If certain parameters consistently have large gradients, directly using them in the updates can lead to instability.

How it works:

Adagrad normalizes by the accumulated sum of the squares of the gradients. Due to which learning rate typically decreases overtime due to accumulation of gradient squares, which helps in more stable and controlled updates.

the update rule can be expressed as

But there are few challenges with the above approach

1. Damping Effect: When using the accumulated sum of squared gradients directly to adjust the learning rate, the learning rate can shrink too rapidly. This is because the sum of squared gradients grows quickly over time, especially with large or consistent gradients. As a result, the learning rate might become excessively small, leading to very slow learning and making it hard for the model to converge efficiently.
2. Normalization: Squared gradients can become very large, particularly in high-dimensional data or when gradients are frequently large. If we directly use the sum of squared gradients to adjust the learning rate, the learning rate might diminish to a very small value. This would lead to overly cautious updates and can prevent the algorithm from learning effectively.

Solution: Adagrad uses the square root of the accumulated squared gradients to moderate the learning rate adjustment. This slows the growth of the denominator, leading to a more gradual decrease in the learning rate. By growing more slowly than the sum itself, the square root ensures stable and consistent updates, preventing the learning rate from becoming excessively small and maintaining an effective learning process.

To prevent division by zero, the above expression can be modified as shown below

where ϵ is a small constant added for numerical stability.

Suppose x is a single parameter, initialized at x_0 = 0, with a learning rate α = 0.1, and the following gradients over time, and let’s denote gradient as g and accumulated gradient as G for simplicity,

g1 = 0.6, g2 = 0.4, g3 = 0.5

Time step 1:

• Accumulated gradient: G1= 0.6² = 0.36
• Parameter update: x1 = x0 − 0.1667 * 0.6 = 0​ − 0.1667 * 0.6 = −0.1

Time step 2:

• Accumulated gradient: G2​ = 0.36 + 0.4² = 0.52
• Parameter update: x2 = x1 − 0.1387 * 0.4 = — 0.1 − 0.0548 = — 0.1548

Time step 3:

• Accumulated gradient: G3​ = 0.52 + 0.5² = 0.77
• Parameter update: x3​ = x2 ​− 0.1139 * 0.5 = — 0.1548 − 0.0569 = — 0.2117

This process continues iteratively, adjusting the learning rates and updating the parameters based on the accumulated gradients.

NAG expression

Quantum Algorithm:

We use qiskit to apply gradient descent to the quantum data.

import required libraries

`import numpy as npfrom qiskit import QuantumCircuitfrom qiskit_aer import Aer`

Define objective function and its gradient

`# Objective function and its gradientdef objective_function(x):    return x**2def gradient_function(x):    return 2*x`

initialize the parameters, circuit parameters and quantum circuit

`# Initialize parametersx = 2lr = 0.1 # learning rateepsilon = 1e-8  # Small constant to prevent division by zeronum_iterations = 10# Adagrad parameterssum_gradients_sq = 0# Quantum circuit parametersnum_qubits = 1shots = 1000# Initialize quantum circuitqc = QuantumCircuit(num_qubits, num_qubits)`

Define encoding block

`# Quantum gate for encoding parameterdef encode_parameter(qc, x):    # Scale the parameter to fit within the range [-1, 1]    scaled_x = (2 * np.clip(x, 0, 1)) - 1    qc.ry(2 * np.arcsin(np.sqrt((scaled_x + 1) / 2)), 0)`

measurement and simulator block

`# Quantum gate for measuring objective functiondef measure_objective(qc):    qc.measure(0, 0)# Get simulator backendsimulator = Aer.get_backend('qasm_simulator')`

`# Run gradient descent iterationsfor i in range(num_iterations):    # Compute gradient    gradient = gradient_function(x)        # Update sum of squared gradients    sum_gradients_sq += gradient ** 2        # Compute adaptive learning rate    adaptive_learning_rate = lr / np.sqrt(sum_gradients_sq + epsilon)        # Update parameter using Adagrad    x -= adaptive_learning_rate * gradient    # Reset quantum circuit for each iteration    qc.data = []    # Encode parameter into quantum state    encode_parameter(qc, x)        # Measure objective function    measure_objective(qc)    # Execute the circuit and obtain measurement results    job = simulator.run(qc, shots=shots)    result = job.result()    counts = result.get_counts(qc)    # Estimate objective function value    estimated_value = (counts.get('0', 0) - counts.get('1', 0)) / shots    print(f"""Iteration {i+1}: Parameter = {x},               Objective function value = {estimated_value}""")# Print final resultprint(f"Optimized parameter value: {x}")`

Results:

`Iteration 1: Parameter = 1.90000000003125,               Objective function value = -1.0Iteration 2: Parameter = 1.8311250538504986,               Objective function value = -1.0Iteration 3: Parameter = 1.7758215150562742,               Objective function value = -1.0Iteration 4: Parameter = 1.7285570378465538,               Objective function value = -1.0Iteration 5: Parameter = 1.6867615981817918,               Objective function value = -1.0Iteration 6: Parameter = 1.6489968818891936,               Objective function value = -1.0Iteration 7: Parameter = 1.6143626690650583,               Objective function value = -1.0Iteration 8: Parameter = 1.5822515391978071,               Objective function value = -1.0Iteration 9: Parameter = 1.5522308241118843,               Objective function value = -1.0Iteration 10: Parameter = 1.5239794475687682,               Objective function value = -1.0Optimized parameter value: 1.5239794475687682`

Now let’s do a comparison between Adagrad, GD, mini-batch GD, Momentum based GD, NAG, SGD as shown below

Animated Image with iterations:

Below is the data generated and use for plots

`                   Optimizer  Iteration  Parameter  Objective_Function_Value0          Gradient Descent          1   1.600000                    -1.0001         Momentum-Based GD          1   0.800000                    -0.6022   Nesterov Accelerated GD          1   0.310000                     0.3603                       SGD          1  -0.090000                     1.0004             Mini-batch GD          1  -0.490000                     1.0005                   Adagrad          1  -0.590000                     1.0006          Gradient Descent          2  -0.472000                     1.0007         Momentum-Based GD          2  -0.920000                     1.0008   Nesterov Accelerated GD          2  -0.892000                     1.0009                       SGD          2  -0.774000                     1.00010            Mini-batch GD          2  -1.174000                     1.00011                  Adagrad          2  -1.145705                     1.00012         Gradient Descent          3  -0.916564                     1.00013        Momentum-Based GD          3  -0.810542                     1.00014  Nesterov Accelerated GD          3  -0.671401                     1.00015                      SGD          3  -0.442260                     1.00016            Mini-batch GD          3  -0.324260                     1.00017                  Adagrad          3  -0.276106                     1.00018         Gradient Descent          4  -0.220884                     1.00019        Momentum-Based GD          4  -0.003942                     1.00020  Nesterov Accelerated GD          4  -0.038720                     1.00021                      SGD          4   0.016501                     0.97622            Mini-batch GD          4   0.245642                     0.51423                  Adagrad          4   0.257169                     0.43624         Gradient Descent          5   0.205735                     0.61225        Momentum-Based GD          5   0.283561                     0.43226  Nesterov Accelerated GD          5   0.142127                     0.67227                      SGD          5   0.090694                     0.82228            Mini-batch GD          5   0.145915                     0.73029                  Adagrad          5   0.135239                     0.70830         Gradient Descent          6   0.108191                     0.79431        Momentum-Based GD          6   0.112506                     0.76432  Nesterov Accelerated GD          6  -0.004542                     1.00033                      SGD          6  -0.031590                     1.00034            Mini-batch GD          6  -0.083024                     1.00035                  Adagrad          6  -0.088629                     1.00036         Gradient Descent          7  -0.070903                     1.00037        Momentum-Based GD          7  -0.034391                     1.00038  Nesterov Accelerated GD          7  -0.106666                     1.00039                      SGD          7  -0.088940                     1.00040            Mini-batch GD          7  -0.115988                     1.00041                  Adagrad          7  -0.112317                     1.00042         Gradient Descent          8  -0.089853                     1.00043        Momentum-Based GD          8  -0.013757                     1.00044  Nesterov Accelerated GD          8  -0.081293                     1.00045                      SGD          8  -0.058830                     1.00046            Mini-batch GD          8  -0.041104                     1.00047                  Adagrad          8  -0.036457                     1.00048         Gradient Descent          9  -0.029166                     1.00049        Momentum-Based GD          9   0.049077                     0.88650  Nesterov Accelerated GD          9  -0.033631                     1.00051                      SGD          9  -0.026340                     1.00052            Mini-batch GD          9  -0.003877                     1.00053                  Adagrad          9  -0.002368                     1.00054         Gradient Descent         10  -0.001895                     1.00055        Momentum-Based GD         10   0.063086                     0.86456  Nesterov Accelerated GD         10  -0.026441                     1.00057                      SGD         10  -0.025967                     1.00058            Mini-batch GD         10  -0.018676                     1.00059                  Adagrad         10  -0.018578                     1.000`

references: qiskit textbook, open source materials, deep learning course(IIT M).