The EM algorithm, in essence, uses Jensen's inequality to find a lower bound to the above quantity by taking advantage of the linear superposition qualities of the individual probability models. The log likelihood (the function being maximized) is manipulated using this simple bounding technique in Equation 7.4 [5] [73]. Jensen's inequality lower bounds the logarithm of the sum to make it easier to handle. The fundamental result is that the logarithm is applied to each simple model individually instead of applying to the whole sum. It is then straightforward to compute the derivatives with respect to and set those equal to zero to do the maximization.

Let us now consider the case when we are estimating a conditional
density [47]. Conditioning the discrete mixture model in
Equation 7.3 results in
Equation 7.5. This is a conditional
density with missing data *m* which has been re-written as a joint
density with missing data *m* over a marginal density with missing
data *m*. We shall show that this missing data problem can not be
solved using the standard EM approach.

The conditional log-likelihood utilizes this density model and is evaluated as in Equation 7.6. This is also an incremental expression ( ) indicating that we are maximizing the improvement the model undergoes with each iteration (versus directly maximizing the model itself).

Note that the above expression is the difference between the
logarithms of two ratios (*R*_{j} and *R*_{m}). *R*_{j} is the ratio of the
new joint over old joint and *R*_{m} is the ratio of the new marginal
over the old marginal. In maximum likelihood however, only the first
term is present (*R*_{j}). We could obtain a lower bound for it using
Jensen's inequality. Since the log function is concave, Jensen will
lower bound the log of a sum or expectation. However, in the maximum
conditional likelihood situation, we subtract the second term (the
ratio of marginals). Jensen can provide a lower bound for this second
term however it is being subtracted. Since subtracting the second
terms indicates that we are using the negative of the lower bound it
instead acts as an upper bound. Thus, we will not be bounding
conditional likelihood if we apply Jensen's inequality to the second
term and can no longer guarantee convergence. Thus, let us refrain and
only apply it to the first term ()
as in
Equation 7.7.

As a side note we can see that the conditional density is tricky
because it is the joint divided by the marginal (over *x*). For a
mixture model, the numerator (the joint) is a linear combination of
multiple models. However, the marginal is also a linear
combination of models and it resides in the denominator. Thus, linear
superposition occurring in joint mixture models does not hold when they
are conditioned and things become more unmanageable. In fact, it seems
that we simply can not pull the summation outside of the logarithm in
the above expression to end up with an elegant maximization step.

At this point, if we were only considering EM approaches, we would typically resort to a Generalized Expectation Maximization approach (GEM) [15] which involves some gradient ascent techniques to make up for the inapplicability of EM. For instance, the Iteratively Re-Weighted Least Squares algorithm does such an operation in the Mixtures of Experts [31] [30] architecture. The estimation uses EM on a piece of the problem and then resorts to a gradient ascent approach in an inner loop which can not be posed as a typical EM iteration. This is still better than performing the whole conditional likelihood maximization completely with gradient ascent (as proposed by Bishop [5]). This is due to the extra convergence properties the EM-bound provides on top of the inner gradient ascent steps. However, it is not necessary to, dare we say, throw in the towel and resort to gradient ascent quite yet if more bounding techniques (i.e. other than EM and Jensen) can be first investigated. We shy away from the gradient ascent technique since it is often plagued with ad hoc step size estimation, slow non-monotonic convergence, and susceptibility to local maxima.

Recall earlier that we discussed the existence of numerous bounds for
upper and lower bounding. It is important, in addition, that the bound
chosen makes contact with the desired function at the current
operating point. In the above case, the so-called current operating is
,
the previous state of the model we are
optimizing. When we are at our current operating point
(i.e.
), the ratio of the marginals between
and
must equal 1. Thus, the logarithm of
the *R*_{m} ratio in the expression above *can* be upper bounded and
thus the negative of the logarithm will be lower bounded. Many
functions can used to upper bound the logarithm as shown again in
Figure 7.1. The choices of the bounds include
the following functions: *x*-1,
,
and
*e*^{x}-1. We choose the tightest convex upper bound on the log which is
*x*-1. It also seems the simplest to manipulate. This manipulation
removes the cumbersome logarithm of a summation and essentially
decouples the sub-models that are added inside. We end up with
Equation 7.8 which is much easier to manipulate. In
fact, for our purposes, this expression will act as the analog of the
function for the EM algorithm.

At this point, we re-write a conditioned mixture model expression
without *any* joint density models in
Equation 7.9. This is similar to the Mixture of
Experts notation [31] where many conditional models (called
experts) are gated by marginal densities. This decomposition
introduces an important flexibility in the conditional density: the
marginal densities or gates need not integrate to one and can merely
be arbitrary positive functions *f*() instead of probability densities
*p*(). This generalization does not violate the form of the
conditional density in any way and only indicates that the gates need
not be pdfs and need not integrate to 1 or any finite number. Thus,
the gates can diverge, have discontinuities, grow towards infinity and
so on (as long as they remain positive) and still maintain a legal
conditional density. In effect, the conditional density
parametrization can be manipulated more flexibly if it is not written
with joint densities and marginals.

Thus, Equation 7.8 can also be written as in
Equation 7.10 in the experts-gates formalism. This
formalism indicates that the parameters in the experts can often be
re-written independently of the parameters of the gates. In other
words, the conditional densities and the marginal densities (in
appropriate coordinates) can be expressed with separate degrees of
freedom. This allows us to decouple the conditional likelihood
maximization further. The function *Q* here is a general bound for any
conditional density. Thus, for the CEM algorithm, the computation of
the *Q* function here is effectively the CE or Conditional Expectation
step. The M-step, or Maximization, is specific to the density we are
currently considering.

Although we are bounding conditional densities, note that we are also
abiding by the convention that the marginal densities can be expressed
as functions *f* since the integration to 1 over the
domain
is not a requisite for conditional densities. Of course, we can, at
any time, re-normalize the gates and express proper joint
densities. Equation 7.11 depicts a more practical
expression for the *Q* function where the computations of the CE step
are completed and are used numerically. The M-step then involves
taking derivatives of the *Q* function or applying further bounds to
maximize the current *Q*. We consider a specific cases where the
densities come from a particular family of models (i.e. conditioned
Gaussian mixture models [63]) and see how the bounding can be applied
further. This will be the M-step for the Gaussian mixture
case. Similar derivations can be performed for other families.