Once again, the contact point for this parabola (i.e. the
)
is the value for the previous model
or
the mean of the gate before any iteration. Let us denote the previous
parameters of the gate as
and
.
To
facilitate the derivation, we can perform what is commonly called a
*whitening* operation on the coordinate system for each gate. For
each gate, we can consider that its previous parameters where
zero-mean and identity covariance. This is done by applying an affine
transformation to the -space (i.e. shifting the data points
appropriately). The data is translated by the mean and
scaled by the inverse of the square root of the covariance
matrix. In theory, nothing changes except that each gate observes
affinely displaced data instead and it then seems to have a zero mean
and identity covariance as its previous operating point. This
whitening operation does not need to be done explicitly
(i.e. numerically) but helps simplify the following derivation for the
bound.

Thus, without loss of generality, we consider that each data point has been appropriately whitened for each gate. The result is that and . Using this assumption, we can compactly solve for the parameters of the bounding parabola ( ). This is done by meeting the three requirements (contact, tangentiality and tightness). The results is shown in Equation 7.23.

As shown earlier, if a globally valid bound is desired, the value of
*w*_{im} must always be greater than the expression in
Equation 7.23 no matter what the value of
.
Thus we can find the minimum *w*_{im} (i.e. *wmin*_{im})
for which this is true and form a parabolic bound for the *Q* function
that is as 'tight' as possible. We shall now determine an efficient
way to compute *wmin*_{im} without searching the high dimensional
space of
each iteration.

Note in the expression for *w*_{im} above that
is used
only in an inner product with itself (
)
or in
an inner product with the whitened data point
.
Thus, we can represent these two quantities
regardless of the dimensionality of the vector space using at most two
degrees of freedom: the magnitude of
and the
cosine of the angle between the vectors
and .
Let us now denote the new variables as follows:
is
the magnitude of ,
is the magnitude of and
is the cosine of the angle between the two
vectors. We then compactly rewrite the bound as in
Equation 7.24.

The *w*_{im} is now lower bounded by the right hand side which is a
function of two free scalars
and .
We wish to find
the maximum of the lower bound at some value of
and .
By taking derivatives of the right hand side with respect to and setting to 0, we find an extremum of the function at
.
However, we also note that
and so
.
If we set
we note that the this solution
is *always* greater than the
.
So,
the extremum we solved for was a *minimum*. Thus, the maximum
occurs at the ends of the interval so
.
Therefore, we
have the relationship
.
In other words, let
us define
.
The expression for *w*_{im} then
simplifies into a function of 1 free variable (*c*) as in
Equation 7.25.

For each value of
we find the value of *c* which maximizes the
value of .
This is done offline (taking a few seconds)
using Brent's method and the result is stored in a 1D lookup table
(LUT). The values are shown in Figure 7.3(a) where we
plot the max of the *g* function with respect to .
Using this
table, we can immediately compute the minimum *wmin*_{im} by looking
up the corresponding value of
for the current data point and
subsequently compute the rest of the parabolic bound's parameters.

An example of a parabolic bound is shown in
Figure 7.3(b) under a sub-component of the *Q*function. This sub-component corresponds to a single data point as we vary
one gate's uni-variate mean. Many such parabolic bounds are computed
(one for each data point and gate combination) and are summed to bound
the *Q* function in its entirety.

- Parabolic Piece-Wise Bound
For greater computational efficiency we compute the gradient of the

*Q*on the mean to get an idea of the*direction*of steepest ascent. We can then know which general direction will move towards for the next iteration. Due to whitening, starts off at the origin and moves from there. Thus, we know if is moving close towards or farther away from it. Equivalently, we know if*c*will be positive or negative. The above LUT (Figure 7.3(a)) returns the*wmin*_{im}which is valid for all possible choices of*c*. However, if we know the direction*c*is going to move along, we only need to consider half of the possibilities in using Brent's method to find the max of . Thus, we obtain two LUTs for*c*positive and for*c*negative as shown in Figure 7.4(a). This typically gives us an even tighter bound since we only need to bound the*Q*function in regions of the landscape where the locus of the maximum could move to. Thus, we are effectively approximating the landscape with two parabolas, which are cut at*c*=0. One is a bound for the values at*c*<0 and one for*c*>0. Figure 7.4(b) depicts this piece wise parabolic bound on a sub-component of*Q*. The function being bounded is represented with a continuous line. One parabola is composed of 'x' symbols and is a valid bound to the right of zero. The other parabola is composed of 'o' symbols and is valid to the left of zero. The piece-wise bound is composed of these two parabolas spliced at the origin (their continuations are shown as dotted lines for visualization). Thus, if the gradient tells us that we will be moving in a direction of increasing , we should compute the parabola made of 'x' symbols. However, if we are moving to the left (because of the sum total influence of the other bounds and functions), we should use the parabola made of 'o' symbols which is a wider and tighter bound.

To obtain a final answer for the update of the gate means
we
simply maximize the sum of parabolas which bounds the *Q* in its
entirety. For *M* gates and *N* data points, we obtain
parabolas each with its own (
). The
update is in Equation 7.26.