Minimize Functions and Enhance Machine Learning

In the previous article, we discussed about functions, multivariable functions, convex and concave functions, derivatives, partial derivatives, gradient, maxima and minima with plots.

Introduction:

Numerous real-world challenges involve the representation of their dynamics through mathematical functions, incorporating specific constraints to achieve optimization objectives.

Optimization problems involve minimizing or maximizing a given quantity under specific constraints. Real-life problems often require optimization techniques like Linear Programming Problems (LPP), Genetic Programming, Particle Swarm Optimization, Differential Evolution Algorithms, and Gradient Descent. However, each method has limitations and may not be suitable for all scenarios.

Gradient Descent, a method for finding the minimum of a differentiable and convex function, is limited to such functions but excels in minimizing loss functions in machine learning and deep learning. Its core principle involves iteratively moving towards the function’s minimum from an initial point. This movement occurs in the direction of the steepest descent, determined by the negative gradient of the function at each point. In this process, the gradient of the function is calculated concerning the parameters being updated.

Gradient Descent is an iterative optimization algorithm used to find the minimum of a function. It works by taking steps proportional to the negative of the gradient of the function at a particular point. The goal is to reach the minimum of the function by iteratively updating the parameters in the direction of the steepest descent until convergence.

let’s consider a function f and set of variables as x then the objective function can be defined as

tangent of the function can be as

here considering partial derivative because x is a set of variables, the rate of change w.r.t to individual variable.

The gradient points in the direction of the steepest ascent of the function, and we want to minimize the function, so we move in the opposite direction.

Q: Why should it be negative? How can we ascertain that it will indeed be negative?

Consider the Multivariate Taylor’s series:

where Δx is the small enough change.

consider the first order expression as shown below

by subtracting the f(x) from both sides, if new loss is less than the previous loss. Then the statement shall be less than zero, this is intended solution.

Let’s calculate the angle between the two vectors Δx and ∇f(x), so that we can identify the angle, where difference between the functions is high. which contributes to maximum loss angle.

let’s consider β be angle between the two vectors Δx and ∇f(x), then the dot product can be as shown below

multiply on both sides with denominator term gives as

from this we can consider,

but from the below expression, as we discussed above

we want this to be less than zero, new loss shall be less than the previous loss.

this will be most negative when cosβ = -1, which means β = 180⁰ chance of bring more reduction to the loss in the system.

The direction Δx that we intend to move in should be at 180⁰ w.r.t the gradient, as shown below

We got the answer for why should it be negative.

now let’s further simplify, we want to know which position leads to the steepest descent.

multiply both sides by the unit vector in the direction of the gradient to find Δx

rewriting the above expression,

let’s consider a variable α as shown below,

Now, replace the above expression with α

since we want to know the future step, we expand Δx which is varying with time.

further simplifying the above expression as,

if we look closely at the value of alpha, it is the function with Δx, which is unknow. so the another approach is trail and error.

In practice, α or η known as learning rate, which controls the convergence and efficiency of the optimization process. In the case of deep learning or machine learning the function shall be loss or cost function.

Due to the fact learning rate needs trail and error approach, the learning rate went through a lot of research over the time,

2. Learning Rate Schedules: Learning rate schedules decrease the learning rate over time. Common schedules include exponential decay, step decay, and polynomial decay.
3. Learning Rate Annealing: This technique gradually reduces the learning rate during training, often based on a predefined schedule or when certain conditions are met.
4. Learning Rate Warm-up: Initially starting with a small learning rate and gradually increasing it during the early stages of training. This helps stabilize the training process and avoid large initial updates.

When the learning rate is too big, we are taking big steps. We may step over the minimum a few times, and then go back and forth several times. With big steps, Gradient Descent may never even reach the minimum and converge.

One way to spot that alpha is too big is when the gradient is not decreasing at every step.

Similarly, when the learning rate is too small, we are taking tiny steps at a time. The algorithm will eventually converge, but it might take a long time to do so.

below is the sample python code to plot the gradient descent

import required modules

`import numpy as npimport matplotlib.pyplot as plt`

defining the function y = x²

`# Define the functiondef function(x):    return x**2`

`# Gradient Descent functiondef gradient_descent(learning_rate, epochs):    x = 10  # Initial value of x    x_values = []    y_values = []    for epoch in range(epochs):        gradient = 2 * x  # Gradient of the function f(x) = x^2        x -= learning_rate * gradient  # Update x using Gradient Descent        x_values.append(x)        y_values.append(function(x))    return x_values, y_values`

initialize and generate the required data

`learning_rate = 0.1epochs = 50x_values, y_values = gradient_descent(learning_rate, epochs)`

plot the graph

`# Plotting the function and Gradient Descent pathx = np.linspace(-10, 10, 100)plt.plot(x, function(x), label='f(x) = x^2')plt.scatter(x_values, y_values, color='red', label='Gradient Descent')plt.title('Gradient Descent Optimization')plt.xlabel('x')plt.ylabel('f(x)')plt.legend()plt.grid(True)plt.show()`

The graph will display the function f(x) = and the path (in red dots) taken by Gradient Descent towards the minimum point.

The red dots indicate the iterative steps taken by Gradient Descent to approach the minimum of the function f(x) = .

3D — plot

Animated image, we can see the iterations moving towards global minima

Quantum Algorithm:

We use qiskit to apply gradient descent to the quantum data.

import required libraries

`import numpy as npfrom qiskit import QuantumCircuitfrom qiskit_aer import Aer`

Define objective function and its gradient

`# Objective function and its gradientdef objective_function(x):    return x**2def gradient_function(x):    return 2*x`

initialize the parameters, circuit parameters and quantum circuit

`# Initialize parametersx = 2learning_rate = 0.1num_iterations = 10# Quantum circuit parametersnum_qubits = 1shots = 1000# Initialize quantum circuitqc = QuantumCircuit(num_qubits, num_qubits)`

Define encoding block

`# Quantum gate for encoding parameterdef encode_parameter(qc, x):    # Scale the parameter to fit within the range [-1, 1]    scaled_x = (2 * np.clip(x, 0, 1)) - 1    qc.ry(2 * np.arcsin(np.sqrt((scaled_x + 1) / 2)), 0)`

measurement and simulator block

`# Quantum gate for measuring objective functiondef measure_objective(qc):    qc.measure(0, 0)# Get simulator backendsimulator = Aer.get_backend('qasm_simulator')`

`# Run gradient descent iterationsfor i in range(num_iterations):    # Compute gradient and update parameter    gradient = gradient_function(x)    x -= learning_rate * gradient        # Reset quantum circuit for each iteration    qc.data = []        # Encode parameter into quantum state    encode_parameter(qc, x)    # Measure objective function    measure_objective(qc)            # Execute the circuit     # and obtain measurement results using the run method    job = simulator.run(qc, shots=shots)    result = job.result()    counts = result.get_counts(qc)        # Estimate objective function value    estimated_value = (counts.get('0', 0) - counts.get('1', 0)) / shots        print(f'''Iteration {i+1}: Parameter = {x},             Objective function value = {estimated_value}''')`

Results:

`Iteration 1: Parameter = 1.6,             Objective function value = -1.0Iteration 2: Parameter = 1.28,             Objective function value = -1.0Iteration 3: Parameter = 1.024,             Objective function value = -1.0Iteration 4: Parameter = 0.8192,             Objective function value = -0.654Iteration 5: Parameter = 0.65536,             Objective function value = -0.31Iteration 6: Parameter = 0.5242880000000001,             Objective function value = -0.034Iteration 7: Parameter = 0.4194304000000001,             Objective function value = 0.206Iteration 8: Parameter = 0.33554432000000006,             Objective function value = 0.274Iteration 9: Parameter = 0.26843545600000007,             Objective function value = 0.446Iteration 10: Parameter = 0.21474836480000006,             Objective function value = 0.568Optimized parameter value: 0.21474836480000006`

Now let’s apply the quantum circuit using qiskit with iris dataset.

import required modules

`import numpy as npfrom sklearn.datasets import load_irisfrom sklearn.preprocessing import MinMaxScalerfrom qiskit import QuantumCircuitfrom qiskit_aer import Aer`

load the iris data using sklearn

`# Load Iris datasetiris = load_iris()X = iris.datay = iris.target`

Normalize the data using Min Max

`# Normalize feature valuesscaler = MinMaxScaler()X_normalized = scaler.fit_transform(X)`

`# Define objective function and its gradientdef objective_function(w):    # Simple objective function:     # sum of squared differences between actual and predicted class labels    predicted_labels = np.argmax(np.dot(X_normalized, w), axis=1)    return np.sum((predicted_labels - y)**2)def gradient_function(w):    # Gradient of the objective function    predicted_labels = np.dot(X_normalized, w)    error = predicted_labels - np.eye(3)[y]    gradient = 2 * np.dot(X_normalized.T, error)    return gradient`

Initialize parameters, quantum parameters and quantum circuit

`# Initialize parametersnum_features = X_normalized.shape[1]num_classes = len(np.unique(y))# Random initial weights for 4 features and 3 classesw = np.random.rand(num_features, num_classes)  learning_rate = 0.01num_iterations = 10# Quantum circuit parametersnum_qubits = num_featuresshots = 1000# Initialize quantum circuitqc = QuantumCircuit(num_qubits, num_qubits)`

define encode function and measurement

`# Quantum gate for encoding parametersdef encode_parameters(qc, w):    for i in range(num_qubits):        for j in range(num_classes):            # Clip values to ensure they fall within [0, 1] range            clipped_value = np.clip(w[i, j], 0, 1)            # Apply arcsin gate            qc.ry(2 * np.arcsin(np.sqrt(clipped_value)), i)# Quantum gate for measuring objective functiondef measure_objective(qc):    qc.measure_all()`

`# Run gradient descent iterationsfor i in range(num_iterations):    # Compute gradient and update parameters    gradient = gradient_function(w)    w -= learning_rate * gradient        # Reset quantum circuit for each iteration    qc.data = []        # Encode parameters into quantum state    encode_parameters(qc, w)        # Measure objective function    measure_objective(qc)        # Execute the circuit and obtain measurement results    simulator = Aer.get_backend('qasm_simulator')    job = simulator.run(qc, shots=shots)    result = job.result()    counts = result.get_counts(qc)        # Estimate objective function value    estimated_value = (        counts.get('0' * num_qubits, 0) - counts.get('1' * num_qubits, 0)        ) / shots        print(f'''Iteration {i+1}: Parameters = {w},             Objective function value = {estimated_value}''')# Print final resultprint(f"Optimized parameters: {w}")`

results:

`Iteration 1: Parameters = [[-0.55944988 -1.20235758 -0.19715711] [ 0.0677421  -0.6549979  -0.69017965] [-0.80637202 -0.7717415  -0.66789503] [-0.40898948 -0.99027818 -0.26574509]],             Objective function value = 0.0Iteration 2: Parameters = [[0.93866476 1.82779808 1.68397019] [1.58955241 1.73571063 0.80464567] [0.78912134 2.69579012 1.4951194 ] [1.16252019 2.42544693 1.92070257]],             Objective function value = 0.0Iteration 3: Parameters = [[-1.91550965 -3.93520851 -1.95829006] [-0.47875638 -2.92780579 -2.12853125] [-2.52807409 -3.83502917 -2.62462701] [-2.16229693 -4.12999915 -2.15958773]],             Objective function value = 0.0Iteration 4: Parameters = [[3.53374516 7.08518456 4.94459588] [4.05884714 5.90135841 3.4177669 ] [3.61715622 8.70889964 5.23857928] [3.95204844 8.33826703 5.71001899]],             Objective function value = 0.0Iteration 5: Parameters = [[ -6.86928351 -13.9310997   -8.27883258] [ -4.18660174 -11.00438714  -7.20220331] [ -8.23932296 -15.16371601  -9.77760658] [ -7.88524848 -15.51153174  -9.23883976]],             Objective function value = 0.0Iteration 6: Parameters = [[12.98417604 26.20257591 16.91628457] [11.84553514 21.22601029 13.04901899] [14.30765938 30.46966183 18.87330136] [14.589388   29.95789545 19.3611545 ]],             Objective function value = 0.0Iteration 7: Parameters = [[-24.91667269 -50.38642912 -31.2193201 ] [-18.54907366 -40.32370056 -25.61671779] [-28.78425235 -56.57219556 -35.8297596 ] [-28.39726737 -56.8901145  -35.16890132]],             Objective function value = 0.0Iteration 8: Parameters = [[ 47.42168471  95.82127523  60.6188082 ] [ 39.61394634  77.13905186  48.18337601] [ 53.43478672 109.63079315  68.56922625] [ 53.59011449 108.82512053  68.97403096]],             Objective function value = 0.0Iteration 9: Parameters = [[ -90.6627867  -183.24148188 -114.72028405] [ -71.30266695 -147.08821885  -92.68442971] [-103.52230986 -207.55647755 -130.72377339] [-102.954024   -207.54749227 -129.75759053]],             Objective function value = 0.0Iteration 10: Parameters = [[172.90279562 349.44173097 219.92415309] [140.48609319 280.89857454 176.20414703] [196.06505421 397.93943323 249.66284424] [195.81699269 396.27846054 249.6286446 ]],             Objective function value = 0.0`
`Optimized parameters: [[172.90279562 349.44173097 219.92415309] [140.48609319 280.89857454 176.20414703] [196.06505421 397.93943323 249.66284424] [195.81699269 396.27846054 249.6286446 ]]`

Q: Why Gradient descent shall works with differentiable functions only?

As we discussed above, Gradient Descent relies on the derivative or gradient of a function to determine the direction and magnitude of the update for each parameter. It requires the function to be differentiable to compute these gradients.

Derivative for Direction:

Gradient Descent uses the derivative or gradient of a function to determine the slope or rate of change of the function at a given point. This derivative points in the direction of the steepest ascent (or descent) and helps determine the update direction for the parameters being optimized.

Update Rule:

The update rule in Gradient Descent involves subtracting a fraction of the gradient from the current parameter value. This update direction is perpendicular to the level curves or contours of the function and is crucial for reaching the minimum.

Q: Why gradient descent works well with convex functions?

Gradient Descent is particularly effective with convex functions due to their fundamental properties, which allow the algorithm to converge to the global minimum efficiently.

Due to single global minimum, gradient points toward the minimum and uniqueness of the solution convex functions works well with gradient descent.

Single Global Minimum:

Convex functions have a single global minimum. There are no local minima or maxima, meaning the function’s shape steadily inclines toward the minimum point. This property ensures that Gradient Descent, which aims to minimize the function, can reach the global minimum.

In convex functions, the gradient always points towards the minimum. The gradient is a vector that indicates the direction of the steepest ascent. When minimizing a convex function, the negative gradient points in the direction of the global minimum, allowing Gradient Descent to move efficiently toward it.

Convergence to the Global Minimum:

Gradient Descent iteratively updates parameters in the direction of the negative gradient. In convex functions, this iterative process converges to the global minimum, ensuring that Gradient Descent converges to the optimal solution without getting stuck in local minima or plateaus.

Uniqueness of the Solution:

Convex functions have a unique minimum, guaranteeing that Gradient Descent will find the same optimal solution regardless of the starting point or initialization.

updated: 22/04/2024 — code format in the cells updated.

references: Wikipedia, open source materials, deep learning course(IIT M).