We now describe the implementation of the CEM algorithm as an iterative process. This typical time per iteration is comparable with a regular mixture of Gaussians model being update with EM. This is partly due to the decomposition of the variables into and . EM uses a large covariance matrix of both variables concatenated together and the matrix inversion in the EM-loop is thus of a higher dimensionality than the CEM matrix inversions of and . In high dimensions, therefore, when matrix inversions become a serious computational costs, the CEM algorithm can outperform EM in the per-iteration time. Some results will be shown in 2D problems that can be visually verified. Other higher dimensionality tests were also conducted and are reported as well.

We assume that the conditional model is initialized using a pseudo-random method which places the initial Gaussians with some coverage of the data. From there the algorithm proceeds in a continuous loop as described by Table 7.1.

Figure 7.7 depicts a set of 2D data points with coordinates
(*x*,*y*). This data has been sampled from 4 Gaussians with circular
covariance. The objective is to fit a conditional model *p*(*y*|*x*) to
the above data with a conditioned mixture of Gaussians model with *only 2 Gaussians*. Thus, the algorithm can not simply group 4 clusters
but must take advantage of some symmetric structure in the problem to
find a more compact probabilistic model. This is thus a clear problem
of limited resources where we would like to maximize performance at
guessing output *y* from input *x* and we do not have the luxury of
modeling the whole generative process (i.e. with 4 Gaussians). Thus,
information about the particular task is critical here.

In Figure 7.8 we note the maximum conditional likelihood
solution which places the Gaussians in a skewed position to
regress the *y* data given an *x* value. Note how the conditioned
Gaussians give no information about *x* since the CEM expects it to be
given. In Figure 7.9 we see another solution where the CEM
was trapped in a slightly inferior local maximum (Conditional Log
Likelihood = -3.166).

Figure 7.10 shows the evolution of the EM algorithm on the
exact same data set. We again plot the conditional log-likelihood and
see that it is *not* being maximized (as expected) while the joint
log-likelihood is being maximized. However, given that this is a
conditional problem between *x*-input and *y*-output, the algorithm is
not optimizing for the correct task and is not being
discriminative. Thus the Gaussians are wasted modeling *x* and do not
recover the bimodal distribution in *y*. The conditional
log-likelihood is also dramatically inferior (-3.6).

The CEM algorithm, when initialized with a solution of the EM algorithm, typically moves to another solution. This solution is often better than the EM's solution but worse than initialization on its own. This indicates that the local maxima of joint likelihood are not necessarily maxima of conditional likelihood.