QAlog: Quantum Gradient Descent

Minimize Functions and Enhance Machine Learning

12 min readApr 21, 2024

In the previous article, we discussed about functions, multivariable functions, convex and concave functions, derivatives, partial derivatives, gradient, maxima and minima with plots.

In this article, we discuss about the gradient descent algorithm and it’s limitations using classical algorithm and qiskit quantum algorithm.

Introduction:

Numerous real-world challenges involve the representation of their dynamics through mathematical functions, incorporating specific constraints to achieve optimization objectives.

Optimization problems involve minimizing or maximizing a given quantity under specific constraints. Real-life problems often require optimization techniques like Linear Programming Problems (LPP), Genetic Programming, Particle Swarm Optimization, Differential Evolution Algorithms, and Gradient Descent. However, each method has limitations and may not be suitable for all scenarios.

Gradient Descent, a method for finding the minimum of a differentiable and convex function, is limited to such functions but excels in minimizing loss functions in machine learning and deep learning. Its core principle involves iteratively moving towards the function’s minimum from an initial point. This movement occurs in the direction of the steepest descent, determined by the negative gradient of the function at each point. In this process, the gradient of the function is calculated concerning the parameters being updated.

Gradient Descent:

Gradient Descent is an iterative optimization algorithm used to find the minimum of a function. It works by taking steps proportional to the negative of the gradient of the function at a particular point. The goal is to reach the minimum of the function by iteratively updating the parameters in the direction of the steepest descent until convergence.

let’s consider a function f and set of variables as x then the objective function can be defined as

tangent of the function can be as

here considering partial derivative because x is a set of variables, the rate of change w.r.t to individual variable.

The gradient points in the direction of the steepest ascent of the function, and we want to minimize the function, so we move in the opposite direction.

Q: Why should it be negative? How can we ascertain that it will indeed be negative?

Consider the Multivariate Taylor’s series:

where Δx is the small enough change.

consider the first order expression as shown below

by subtracting the f(x) from both sides, if new loss is less than the previous loss. Then the statement shall be less than zero, this is intended solution.

Let’s calculate the angle between the two vectors Δx and ∇f(x), so that we can identify the angle, where difference between the functions is high. which contributes to maximum loss angle.

let’s consider β be angle between the two vectors Δx and ∇f(x), then the dot product can be as shown below

multiply on both sides with denominator term gives as

from this we can consider,

but from the below expression, as we discussed above

we want this to be less than zero, new loss shall be less than the previous loss.

this will be most negative when cosβ = -1, which means β = 180⁰ chance of bring more reduction to the loss in the system.

The direction Δx that we intend to move in should be at 180⁰ w.r.t the gradient, as shown below

We got the answer for why should it be negative.

now let’s further simplify, we want to know which position leads to the steepest descent.

multiply both sides by the unit vector in the direction of the gradient to find Δx

rewriting the above expression,

let’s consider a variable α as shown below,

Now, replace the above expression with α

since we want to know the future step, we expand Δx which is varying with time.

further simplifying the above expression as,

if we look closely at the value of alpha, it is the function with Δx, which is unknow. so the another approach is trail and error.

In practice, α or η known as learning rate, which controls the convergence and efficiency of the optimization process. In the case of deep learning or machine learning the function shall be loss or cost function.

Due to the fact learning rate needs trail and error approach, the learning rate went through a lot of research over the time,

Adaptive Learning Rates: Techniques like AdaGrad, RMSProp, and Adam adjust the learning rate dynamically based on the history of gradients. These methods aim to improve convergence and stability during training.
Learning Rate Schedules: Learning rate schedules decrease the learning rate over time. Common schedules include exponential decay, step decay, and polynomial decay.
Learning Rate Annealing: This technique gradually reduces the learning rate during training, often based on a predefined schedule or when certain conditions are met.
Learning Rate Warm-up: Initially starting with a small learning rate and gradually increasing it during the early stages of training. This helps stabilize the training process and avoid large initial updates.

When the learning rate is too big, we are taking big steps. We may step over the minimum a few times, and then go back and forth several times. With big steps, Gradient Descent may never even reach the minimum and converge.

One way to spot that alpha is too big is when the gradient is not decreasing at every step.

Similarly, when the learning rate is too small, we are taking tiny steps at a time. The algorithm will eventually converge, but it might take a long time to do so.

below is the sample python code to plot the gradient descent

import required modules

import numpy as np
import matplotlib.pyplot as plt

defining the function y = x²

# Define the function
def function(x):
    return x**2

gradient descent function

# Gradient Descent function
def gradient_descent(learning_rate, epochs):
    x = 10  # Initial value of x
    x_values = []
    y_values = []

    for epoch in range(epochs):
        gradient = 2 * x  # Gradient of the function f(x) = x^2
        x -= learning_rate * gradient  # Update x using Gradient Descent
        x_values.append(x)
        y_values.append(function(x))

    return x_values, y_values

initialize and generate the required data

learning_rate = 0.1
epochs = 50
x_values, y_values = gradient_descent(learning_rate, epochs)

plot the graph

# Plotting the function and Gradient Descent path
x = np.linspace(-10, 10, 100)
plt.plot(x, function(x), label='f(x) = x^2')
plt.scatter(x_values, y_values, color='red', label='Gradient Descent')

plt.title('Gradient Descent Optimization')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.legend()
plt.grid(True)
plt.show()

The graph will display the function f(x) = x² and the path (in red dots) taken by Gradient Descent towards the minimum point.

The red dots indicate the iterative steps taken by Gradient Descent to approach the minimum of the function f(x) = x².

3D — plot

Animated image, we can see the iterations moving towards global minima

Quantum Algorithm:

We use qiskit to apply gradient descent to the quantum data.

import required libraries

import numpy as np
from qiskit import QuantumCircuit
from qiskit_aer import Aer

Define objective function and its gradient

# Objective function and its gradient
def objective_function(x):
    return x**2

def gradient_function(x):
    return 2*x

initialize the parameters, circuit parameters and quantum circuit

# Initialize parameters
x = 2
learning_rate = 0.1
num_iterations = 10

# Quantum circuit parameters
num_qubits = 1
shots = 1000

# Initialize quantum circuit
qc = QuantumCircuit(num_qubits, num_qubits)

Define encoding block

# Quantum gate for encoding parameter
def encode_parameter(qc, x):
    # Scale the parameter to fit within the range [-1, 1]
    scaled_x = (2 * np.clip(x, 0, 1)) - 1
    qc.ry(2 * np.arcsin(np.sqrt((scaled_x + 1) / 2)), 0)

measurement and simulator block

# Quantum gate for measuring objective function
def measure_objective(qc):
    qc.measure(0, 0)

# Get simulator backend
simulator = Aer.get_backend('qasm_simulator')

Gradient descent iteration

# Run gradient descent iterations
for i in range(num_iterations):
    # Compute gradient and update parameter
    gradient = gradient_function(x)
    x -= learning_rate * gradient
    
    # Reset quantum circuit for each iteration
    qc.data = []
    
    # Encode parameter into quantum state
    encode_parameter(qc, x)

    # Measure objective function
    measure_objective(qc)
    
    
    # Execute the circuit 
    # and obtain measurement results using the run method
    job = simulator.run(qc, shots=shots)
    result = job.result()
    counts = result.get_counts(qc)
    
    # Estimate objective function value
    estimated_value = (counts.get('0', 0) - counts.get('1', 0)) / shots
    
    print(f'''Iteration {i+1}: Parameter = {x}, 
            Objective function value = {estimated_value}''')

Results:

Iteration 1: Parameter = 1.6, 
            Objective function value = -1.0
Iteration 2: Parameter = 1.28, 
            Objective function value = -1.0
Iteration 3: Parameter = 1.024, 
            Objective function value = -1.0
Iteration 4: Parameter = 0.8192, 
            Objective function value = -0.654
Iteration 5: Parameter = 0.65536, 
            Objective function value = -0.31
Iteration 6: Parameter = 0.5242880000000001, 
            Objective function value = -0.034
Iteration 7: Parameter = 0.4194304000000001, 
            Objective function value = 0.206
Iteration 8: Parameter = 0.33554432000000006, 
            Objective function value = 0.274
Iteration 9: Parameter = 0.26843545600000007, 
            Objective function value = 0.446
Iteration 10: Parameter = 0.21474836480000006, 
            Objective function value = 0.568
Optimized parameter value: 0.21474836480000006

Now let’s apply the quantum circuit using qiskit with iris dataset.

import required modules

import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import MinMaxScaler

from qiskit import QuantumCircuit
from qiskit_aer import Aer

load the iris data using sklearn

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

Normalize the data using Min Max

# Normalize feature values
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)

Define Objective and gradient function

# Define objective function and its gradient
def objective_function(w):
    # Simple objective function: 
    # sum of squared differences between actual and predicted class labels
    predicted_labels = np.argmax(np.dot(X_normalized, w), axis=1)
    return np.sum((predicted_labels - y)**2)

def gradient_function(w):
    # Gradient of the objective function
    predicted_labels = np.dot(X_normalized, w)
    error = predicted_labels - np.eye(3)[y]
    gradient = 2 * np.dot(X_normalized.T, error)
    return gradient

Initialize parameters, quantum parameters and quantum circuit

# Initialize parameters
num_features = X_normalized.shape[1]
num_classes = len(np.unique(y))
# Random initial weights for 4 features and 3 classes
w = np.random.rand(num_features, num_classes)  
learning_rate = 0.01
num_iterations = 10

# Quantum circuit parameters
num_qubits = num_features
shots = 1000

# Initialize quantum circuit
qc = QuantumCircuit(num_qubits, num_qubits)

define encode function and measurement

# Quantum gate for encoding parameters
def encode_parameters(qc, w):
    for i in range(num_qubits):
        for j in range(num_classes):
            # Clip values to ensure they fall within [0, 1] range
            clipped_value = np.clip(w[i, j], 0, 1)
            # Apply arcsin gate
            qc.ry(2 * np.arcsin(np.sqrt(clipped_value)), i)

# Quantum gate for measuring objective function
def measure_objective(qc):
    qc.measure_all()

gradient iterations

# Run gradient descent iterations
for i in range(num_iterations):
    # Compute gradient and update parameters
    gradient = gradient_function(w)
    w -= learning_rate * gradient
    
    # Reset quantum circuit for each iteration
    qc.data = []
    
    # Encode parameters into quantum state
    encode_parameters(qc, w)
    
    # Measure objective function
    measure_objective(qc)
    
    # Execute the circuit and obtain measurement results
    simulator = Aer.get_backend('qasm_simulator')
    job = simulator.run(qc, shots=shots)
    result = job.result()
    counts = result.get_counts(qc)
    
    # Estimate objective function value
    estimated_value = (
        counts.get('0' * num_qubits, 0) - counts.get('1' * num_qubits, 0)
        ) / shots
    
    print(f'''Iteration {i+1}: Parameters = {w}, 
            Objective function value = {estimated_value}''')

# Print final result
print(f"Optimized parameters: {w}")

results:

Iteration 1: Parameters = [[-0.55944988 -1.20235758 -0.19715711]
 [ 0.0677421  -0.6549979  -0.69017965]
 [-0.80637202 -0.7717415  -0.66789503]
 [-0.40898948 -0.99027818 -0.26574509]], 
            Objective function value = 0.0
Iteration 2: Parameters = [[0.93866476 1.82779808 1.68397019]
 [1.58955241 1.73571063 0.80464567]
 [0.78912134 2.69579012 1.4951194 ]
 [1.16252019 2.42544693 1.92070257]], 
            Objective function value = 0.0
Iteration 3: Parameters = [[-1.91550965 -3.93520851 -1.95829006]
 [-0.47875638 -2.92780579 -2.12853125]
 [-2.52807409 -3.83502917 -2.62462701]
 [-2.16229693 -4.12999915 -2.15958773]], 
            Objective function value = 0.0
Iteration 4: Parameters = [[3.53374516 7.08518456 4.94459588]
 [4.05884714 5.90135841 3.4177669 ]
 [3.61715622 8.70889964 5.23857928]
 [3.95204844 8.33826703 5.71001899]], 
            Objective function value = 0.0
Iteration 5: Parameters = [[ -6.86928351 -13.9310997   -8.27883258]
 [ -4.18660174 -11.00438714  -7.20220331]
 [ -8.23932296 -15.16371601  -9.77760658]
 [ -7.88524848 -15.51153174  -9.23883976]], 
            Objective function value = 0.0
Iteration 6: Parameters = [[12.98417604 26.20257591 16.91628457]
 [11.84553514 21.22601029 13.04901899]
 [14.30765938 30.46966183 18.87330136]
 [14.589388   29.95789545 19.3611545 ]], 
            Objective function value = 0.0
Iteration 7: Parameters = [[-24.91667269 -50.38642912 -31.2193201 ]
 [-18.54907366 -40.32370056 -25.61671779]
 [-28.78425235 -56.57219556 -35.8297596 ]
 [-28.39726737 -56.8901145  -35.16890132]], 
            Objective function value = 0.0
Iteration 8: Parameters = [[ 47.42168471  95.82127523  60.6188082 ]
 [ 39.61394634  77.13905186  48.18337601]
 [ 53.43478672 109.63079315  68.56922625]
 [ 53.59011449 108.82512053  68.97403096]], 
            Objective function value = 0.0
Iteration 9: Parameters = [[ -90.6627867  -183.24148188 -114.72028405]
 [ -71.30266695 -147.08821885  -92.68442971]
 [-103.52230986 -207.55647755 -130.72377339]
 [-102.954024   -207.54749227 -129.75759053]], 
            Objective function value = 0.0
Iteration 10: Parameters = [[172.90279562 349.44173097 219.92415309]
 [140.48609319 280.89857454 176.20414703]
 [196.06505421 397.93943323 249.66284424]
 [195.81699269 396.27846054 249.6286446 ]], 
            Objective function value = 0.0

Optimized parameters: [[172.90279562 349.44173097 219.92415309]
 [140.48609319 280.89857454 176.20414703]
 [196.06505421 397.93943323 249.66284424]
 [195.81699269 396.27846054 249.6286446 ]]

Q: Why Gradient descent shall works with differentiable functions only?

As we discussed above, Gradient Descent relies on the derivative or gradient of a function to determine the direction and magnitude of the update for each parameter. It requires the function to be differentiable to compute these gradients.

Derivative for Direction:

Gradient Descent uses the derivative or gradient of a function to determine the slope or rate of change of the function at a given point. This derivative points in the direction of the steepest ascent (or descent) and helps determine the update direction for the parameters being optimized.

Update Rule:

The update rule in Gradient Descent involves subtracting a fraction of the gradient from the current parameter value. This update direction is perpendicular to the level curves or contours of the function and is crucial for reaching the minimum.

Q: Why gradient descent works well with convex functions?

Gradient Descent is particularly effective with convex functions due to their fundamental properties, which allow the algorithm to converge to the global minimum efficiently.

Due to single global minimum, gradient points toward the minimum and uniqueness of the solution convex functions works well with gradient descent.

Single Global Minimum:

Convex functions have a single global minimum. There are no local minima or maxima, meaning the function’s shape steadily inclines toward the minimum point. This property ensures that Gradient Descent, which aims to minimize the function, can reach the global minimum.

Gradient Points Toward the Minimum:

In convex functions, the gradient always points towards the minimum. The gradient is a vector that indicates the direction of the steepest ascent. When minimizing a convex function, the negative gradient points in the direction of the global minimum, allowing Gradient Descent to move efficiently toward it.

Convergence to the Global Minimum:

Gradient Descent iteratively updates parameters in the direction of the negative gradient. In convex functions, this iterative process converges to the global minimum, ensuring that Gradient Descent converges to the optimal solution without getting stuck in local minima or plateaus.

Uniqueness of the Solution:

Convex functions have a unique minimum, guaranteeing that Gradient Descent will find the same optimal solution regardless of the starting point or initialization.

updated: 22/04/2024 — code format in the cells updated.

references: Wikipedia, open source materials, deep learning course(IIT M).

QAlog: Quantum Gradient Descent

Minimize Functions and Enhance Machine Learning

Written by Anonymousket