Next: Placing and Tightening the Up: Optimization - General Bound Previous: Quadratic Bounds

Maximizing with Bounds: Beyond Gradient Ascent

Now consider the use of a bound such as the one proposed above. We have an analytically specified function, f, which we are trying to optimize starting from an initial point x^* = x₁. Figure 6.1 depicts the bound and the function to be optimized. Note the contact point where the function meets the bound is at x^*.

**Figure 6.1:** A Parabolic Bound on an Arbitrary Function
$\begin{figure}\center \begin{tabular}[b]{c} \epsfysize=1in \epsfbox{bound1.eps} \end{tabular}\end{figure}$

We can now find the maximum of that bound and move the contact point x^* to this new locus on the function. This point is now x₂ and is used to find another parabola which lies below the function f and touches it at our current operating point, x₂. As this process is iterated, we gradually converge to a maximum of the function. This process is illustrated in Figure 6.2. At the function's maximum, however, the only parabola we can draw which is a lower bound is one whose peak is at the function's maximum (i.e. a 'fixed' point). The algorithm basically stops there. Thus, there is a natural notion of a fixed point approach. Such an algorithm will always converge to a local maximum and stay there once locked on.

**Figure 6.2:** Bound Maximization Convergence
$\begin{figure}\center \begin{tabular}[b]{c} \epsfysize=1in \epsfbox{bound2.eps} \end{tabular}\end{figure}$

In the example, the algorithm skipped out of a local maximum on its way to its destination. This is not due to the fact that the local maximum is lower but due to the some peculiarity of the bounding approach and its robustness to very local attractor. The bound is shaped by the overall structure function which has a wider and higher local maximum at x^final. We will discuss this trade-off between the amplitude of a local maximum and its robustness (width of the basin) later on and their relationship to annealing and global optimization.

Note how these bounding techniques differ from gradient ascent which is a first order Taylor approximation. A second order Taylor approximation is reminiscent of Newton and Hessian methods which contain higher order derivative data. However, both gradient ascent and higher order approximations to the functions are not lower bounds and hence using these to maximize the function is not always guaranteed to converge. Basically gradient ascent is a degenerate case of the parabolic bound approach where the width of the parabola is set to infinity (i.e. a straight line) and the step size chosen in an ad hoc manner by the user (i.e. infinitesimal) instead of in a more principled way. A parabolic bound's peak can be related to its width and if this value can be properly estimated, it will never fail to converge. However, selection of an arbitrary step size in gradient ascent can not be guaranteed to converge. Observe Figure 6.3 which demonstrates how an invalid step size can lead gradient approaches astray. In addition Figure 6.4 demonstrates how higher order methods can also diverge away from maxima due to adhoc step size. The parabola estimated using Taylor approximations is neither an upper nor a lower bound (rather it is a good approximation which is a different quality altogether). Typically, these techniques are only locally valid and may become invalid for significantly large step sizes. Picking the maximum of this parabola is not provably correct and again an ad hoc step size constraint is needed.

**Figure 6.3:** Gradient Ascent
$\begin{figure}\center \begin{tabular}[b]{c} \epsfysize=1in \epsfbox{bound3.eps} \end{tabular}\end{figure}$

**Figure 6.4:** 2nd Order Approximations
$\begin{figure}\center \begin{tabular}[b]{c} \epsfysize=1in \epsfbox{bound4.eps} \end{tabular}\end{figure}$

However, gradient and Taylor approximation approaches have remained popular because of their remarkable ease of use and general applicability. Bound techniques (such as the Expectation Maximization or EM algorithm [15]) have not been as widely applicable. This is because almost any analytic function can be differentiated to obtain a gradient or higher order approximation. However, we shall show some important properties of bounds which should illustrate their wide applicability and usefulness. In addition, some examples of nonlinear optimization and conditional density estimations will be given as demonstrations.

Next: Placing and Tightening the Up: Optimization - General Bound Previous: Quadratic Bounds

Tony Jebara
1999-09-15