In this chapter, we discuss the extension of the EM (*Expectation
Maximization*) algorithm for joint density estimation to conditional
density estimation and derive the resulting CEM (*Conditional
Expectation Maximization*) algorithm. The machine learning system that
is central to the Action-Reaction Learning paradigm requires
conditional densities to model interaction. Thus, the optimization
techniques described in the preceding chapter will be called upon to
formally derive the necessary algorithm. The use of CEM is
demonstrated for a specific probabilistic model, a conditioned mixture
of Gaussians, and implementation and machine learning issues are
discussed.

In joint density estimation, the tried and true EM (expectation maximization) algorithm proceeds by maximizing likelihood over a training set. By introducing the concept of missing data, the algorithm breaks down what would be a complex density optimization into a two-step iteration. The missing unknown data components are estimated via the E-step and then a far simpler maximization over the complete data, the M-step, is performed. This is in fact identical to the iteration described earlier for General Bound Maximization. The E-step basically computes a bound (via Jensen's inequality) for the likelihood and the M-step maximizes that bound. Iteration of this process is guaranteed to converge to a local maximum of likelihood.

Recall the maximum likelihood joint density estimation (ML) case in
Equation 5.2. Imagine we are faced with such a joint
density optimization task and need to compute an optimal
over
sets
and
of *N* points as in
Equation 7.1.

The EM algorithm is an iterative optimization which begins seeded at an initial starting point in the model space ( ). Instead of maximizing the expression in Equation 7.1, one can instead maximize its iterative form in Equation 7.2 where is the algorithm's estimate for a better model from the previous estimate .

The density
in general is not
simple and is often better described by the discrete or continuous
summation of simpler models. This is always true when data is
missing. In other words, it is a mixture model as in
Equation 7.3. The summation is performed over what
is typically called the 'missing components' (
or *m*) which
are individually more tractable. Thus, the original rather complex pdf
can be considered to be a linear integration or summation of simpler
pdf models.