Gradient Descent and Newton's Method Tutorial

We reviewed some basic calculus including the idea of a Taylor series as a way of approximating a function about a point using an infinite series:

\begin{equation} f(x) = \sum\limits_{n=0}^{\infty}\frac{f^{(n)}(a)}{n!}(x - a)^n \end{equation}

This is a very important and simple idea. The term $f^{(n)}(a)$ represents the $nth$ derivative of $f$ evaluated at point $a$, and when $a=0$ this is called a Maclaurin series.

We saw the following example. Suppose $a=0$ and we have a function $f(x) = e^{x}$. Note that $\frac{d}{dx}f(x) = e^{x}$. Expanding this by applying the above and we get: \begin{equation} f(x) = e^{x} = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \mathcal{O}(x^4) \end{equation}

Let's look at the first four terms. The zeroth term, $x^0 = 1$ describes a horizontal line where $f(x) = 1$ regardless of the value of x. The second term $f(x) = x$ defines an equation of a line where the intercept is zero and slope is 1. Adding the first two terms $1 + x$ then gives a linear or $\textit{first order approximation}$ of the function $f(x) = e^{x}$. Let's plot this.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
def graph(formula, x_range):  
    x = np.array(x_range)  
    y = formula(x) 
    plt.plot(x, y)  
    #plt.show()  
In [3]:
graph(lambda x: 1 + x, np.arange(-3,3,0.1))
graph(lambda x: np.exp(x), np.arange(-3,3,0.1))

What happens if we add more terms? A second order term would include the function $f(x) = \frac{x^2}{2!}$.

In [4]:
graph(lambda x: 1 + x + 0.5*x**2, np.arange(-3,3,0.1))
graph(lambda x: np.exp(x), np.arange(-3,3,0.1))

We are now hugging the function a bit longer. It seems adding more terms gives us a better local approximation. What happens if we add an order three term $f(x) = \frac{x^3}{3!}$ so that our approximation is now:

\begin{equation} f(x) \approx 1 + x + \frac{x^2}{2} + \frac{x^3}{6} \end{equation}
In [5]:
graph(lambda x: 1 + x + 0.5*x**2 + (1./6)*x**3, np.arange(-3,3,0.1))
graph(lambda x: np.exp(x), np.arange(-3,3,0.1))

Check Your Understanding

  1. Try adding higher order terms yourself. Change the function and compute its Taylor series via the formula above, plot the result for different order approximations.

  2. Prove Taylor's theorem.

Isaac Newton developed a method to exploit the above in order to find roots of a function. That is, for a function $f(x)$, find the value of $x$ such that $f(x) = 0$. Consider a first order expansion of a function:

\begin{equation} y = f(x_n) + f'(x_n)(x-x_n) \end{equation}

Set this equal to zero and solve for the next point in terms of previous points:

\begin{align} 0 &= f(x_n) + f'(x_n)(x_{n+1} - x_n)\\ f'(x_n)x_{n+1} &= f'(x_n)x_n - f(x_n)\\ x_{n+1} &= x_n - \frac{f(x_n)}{f'(x_n)} \end{align}

This gives an iterative method for finding roots. Write a function that solves for the roots of an equation. Before you continue, make sure you understand this.

You can check a trivial implementation here:

http://danielhomola.com/2016/02/09/newtons-method-with-10-lines-of-python/

In [6]:
def dx(f, x):
    return abs(0-f(x))
 
def newtons_method(f, df, x0, epsilon):
    delta = dx(f, x0)
    while delta > epsilon:
        x0 = x0 - f(x0)/df(x0)
        delta = dx(f, x0)
    print 'Root at: ', x0
    print 'f(x) at root: ', f(x0)
    return delta

To use Newton's method (for finding roots) we need to know the function $f(x)$ and it's derivative $f'(x)$. Let's test it out on the following polynomial:

\begin{equation} f(x) = 6x^{5} - 5x^{4} - 4x^{3} + 3x^{2} \end{equation}

Taking the derivative: \begin{equation} f'(x) = 30x^{4} - 20x^{3} - 12x^{2} + 6x \end{equation}

In [7]:
def f(x):
    return 6*x**5-5*x**4-4*x**3+3*x**2
 
def df(x):
    return 30*x**4-20*x**3-12*x**2+6*x

Let's plot the function:

In [8]:
graph(lambda x: 6*x**5-5*x**4-4*x**3+3*x**2, np.arange(-1.5,2.,0.1))

We can now find the root for any initial point we are interested in:

In [9]:
x0s = [0, .5, 1, 1.5]
for x0 in x0s:
    newtons_method(f, df, x0, 1e-5)
Root at:  0
f(x) at root:  0
Root at:  0.628668078167
f(x) at root:  -1.37853879978e-06
Root at:  1
f(x) at root:  0
Root at:  1.00000000005
f(x) at root:  2.16141327058e-10

We can also use Newton's method for finding square roots, or evaluating functions. Suppose we want to numerically evaluate $\sqrt{612}$. Define a function:

\begin{align} f(x) &= x^2 - 612\\ f'(x) &= 2x \end{align}

Check Your Understanding

1) Call the procedure to generate an appoximation to $\sqrt{612}$ by solving for the roots of $f(x)$ above.

2) Solve for the roots of the following:

\begin{align} f(x) &= cos(x) - x^{3}\\ f'(x) &= - sin(s) - 3x^{2} \end{align}

What is a good starting value? Cosine is bounded by $[0,1]$ so try an $x_0$ value of 0.5.

3) Define a few functions yourself, analytically derive / define their derivatives and find their roots in a given interval with starting values. Make sure that you are comfortable with this idea.

Newton Rhapson Method for Optimization

We can write Taylor's theorem for functions of more than one variable. For example, let's consdier a vector $\mathbf{x} \in \mathbb{R}^d$. We can state Taylor's theorem in $n$ dimensions as follows:

\begin{equation} f(\mathbf{x}) = f(\mathbf{a}) + \sum\limits_{k=1}^{d}\frac{\partial f}{\partial x_k}(\mathbf{a})(x_k-a_k) + \frac{1}{2}\sum\limits_{j,k=1}^{d}\frac{\partial^2 f}{\partial x_j \partial x_k}(\mathbf{b})(x_j-a_j)(x_k-a_k) + \mathcal{O}(n^3) \end{equation}

Now consider a second order approximation of a function. We'll consider multivariate functions and consider the gradient $\nabla f(x_n)$ and hessian $H(x_n)$

\begin{align} f(x_{n}) &\approx f(x_n) + \nabla f(x_n)(x_{n+1} - x_n) + \frac{1}{2}(x_{n+1} - x_n)^T H(x_n)(x_{n+1}-x_n) \end{align}\begin{equation} H = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1\partial x_2} & \dots & \frac{\partial^2 f}{\partial x_1\partial x_d} \\ \frac{\partial^2 f}{\partial x_2\partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \dots & \frac{\partial^2 f}{\partial x_2\partial x_d} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_d\partial x_1} & \frac{\partial^2 f}{\partial x_d\partial x_2} & \dots & \frac{\partial^2 f}{\partial x_d^2} \end{bmatrix} \end{equation}

We want to mimize this quadratic function. What value of $x_{n+1}$ minimizes the second order approximation on the RHS? Set this equal to zero, expand terms to simplify and solve for $x_{n+1}$. When $H(x_n)$ is positive definite, then our Newton iteration should be:

\begin{align} x_{n+1} = x_n - H(x_n)^{-1}\nabla f(x_n) \end{align}

Note that we add a step size $\eta$:

\begin{align} x_{n+1} = x_n - \eta \times H(x_n)^{-1}\nabla f(x_n) \end{align}

Check Your Understanding

0) Set the equation above to zero, simplify terms and derive the Newton update equation.

1) Consdier the function below. It has a trivial maximum at $(0,0)$. Take the gradient and the hessian. What does it mean for the hessian to be positive definite?

\begin{equation} f(x,y) = 50 - x^2 - 2y^2 \end{equation}

2) Implement Newton's method for optimization to solve for the optimum of the function.

3) Why did we stop at a second order approximation? That is, would a 3rd order approximation work? Why or why not?

Gradient Descent

Gradient descent is another common technique to find the optimum of a function. Like Newton's method, we see a connection between an iterative process as a differential equation. Let's look at our old example from least squares, the mean squared error. Note that this is a function of the slope $m$ and the intercept $b$ (not the $x_i$'s and $y_i$'s). This multivariate function defines a bowl analogous to the simplified quadratic function $h(x,y)$ below:

\begin{align} f(m,b) &= \frac{1}{N}\sum\limits_{i=1}^{N}(y_i-(mx_i+b))^2 \\ h(x,y) &= \frac{x^2 + y^2}{C} \end{align}

This is a function of two variables $(x,y)$ or $(m,b)$ and so we can plot in three dimensinons:

In [10]:
from mpl_toolkits.mplot3d import Axes3D
import random

# Define the function
def fun(x, y):
    return (x**2 + y**2)
In [27]:
# We call fig to make a figure
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111, projection='3d')
# Here we define arrays of numbers between -3 and 3 with increasing intervals of 0.05
x = y = np.arange(-3.0, 3.0, 0.015)
# We define a mesh of points using our arrays x and y
X, Y = np.meshgrid(x, y)
# Here we evaluate a function at each of the grid points
zs = np.array([fun(x,y) for x,y in zip(np.ravel(X), np.ravel(Y))])
Z = zs.reshape(X.shape)
# Here we plot the function itself
ax.plot_surface(X, Y, Z)

ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('f(x,y)')

plt.show()

The idea behind gradient descent is to roll down the bowl until finding its optimum.

Recall that for a multivariate function $f(x,y)$, the gradient vector field is defined:

\begin{equation} \nabla f(x,y) = \frac{\partial f}{\partial x}i + \frac{\partial f}{\partial y}j \end{equation}

When we speak of a field (not the algebraic structure), we are talking about a function that assigns a scalar value for each point in space. By definition the derivative is the slope of the tangent line at a point:

\begin{equation} \frac{d}{dx}f(x) = \lim_{h\rightarrow 0} \frac{f(x+h) - f(x)}{h} \end{equation}

Equivalently, by definition the gradient points in the direction of steepest ascent. Let's visualize this by plotting the vector field for the quadratic function above.

In [12]:
# Define a grid of 40 points from -5 to 5 for two variables (x and y)
x = y = np.linspace(-5., 5., 40)
xv, yv = np.meshgrid(x, y, indexing='ij', sparse=False)

# Define the function above.
hv = (xv**2+yv**2)/(2**2)

# Define another grid for the derivatives
x2 = y2 = np.linspace(-5.,5.,40)
x2v, y2v = np.meshgrid(x2, y2, indexing='ij', sparse=False)
In [13]:
# We can take derivatives symbolically using Numpy
dhdx, dhdy = np.gradient(hv) # dh/dx, dh/dy
In [14]:
# Now plot the vector field. You can add a contour plot by uncommenting below.
fig = plt.figure(figsize=(8,8))
ax = fig.gca()
ax.quiver(x2v, y2v, dhdx, dhdy, color='black',
          angles='xy', scale_units='xy')
#ax.contour(xv, yv, hv)
plt.axis('equal')
plt.title('Gradient Vector Field')
Out[14]:
<matplotlib.text.Text at 0x1082ef950>

When the gradient is zero, the slope of the tangent line is flat. As we move further away from the bowl the arrows increase in magnitude. The optimum of the function is at the bottom of the bowl. Going in the opposite direction from any of the arrows corresponds to stepping in a direction opposite to the gradient and thus a direction that takes us towards the mimima (in this case, towards the origin).

Check Your Understanding

  1. Plot the mean squared error loss function and it's vector field for least squares regression.
  2. Plot the function below and it's vector field (where $\theta = [\theta_0, \theta_1]^T$ and $\sigma(z) = 1/(1+e^{-z})$ is the sigmoid function). \begin{equation} f(\theta) = \sum\limits_{i=1}^{N}y_i\ ln(\sigma(\theta^Tx_i)) + (1-y_i)ln(1-\sigma(\theta^Tx_i)) \end{equation}
  3. Take the gradient and the hessian of the function above. Try to derive the results we saw in class yourself.
  4. Implement gradient descent and newton's method for the above.
In [ ]: