Recently, Hinton [23] and others proposed strong arguments
for using model-based approaches in classification problems. However,
in certain situations, the advantages of model-based approaches dim in
comparison with performance of discriminative models optimized for a
given task. The following list summarizes some advantages of
generative models and joint density estimation for the purposes of both
classification *and* regression problems.

- Better Inference Algorithms
Generative models and joint densities can be computed using reliable techniques for maximum likelihood and maximum a posteriori estimation. These joint density estimation techniques include the popular EM algorithm and typically outperform gradient ascent algorithms which are the workhorses of conditional density problems (i.e. in many Neural Networks). The EM algorithm provably converges monotonically to a local maximum likelihood solution and often is more efficient than gradient descent.

- Modular Learning
In a generative model, each class is learned individually and only considers the data whose labels correspond to it. The model does not focus upon inter-model discrimination and avoids considering the data as whole. Thus the learning is simplified and the algorithms proceeds faster.

- New Classes
It is possible to learn a new class (or retrain an old class) without updating the models of previous learned classes in a generative model since each model is trained in isolation. In discriminative models, the whole system must be retrained since inter-model dynamics are significant.

- Missing Data
Unlike conditional densities (discriminative models), a joint density or generative model is optimized over the whole dimensionality and thus models all the relationships between the variables in a more equal manner. Thus, if some of the data that was expected to be observed for a given task is missing, a joint model's performance will degrade gracefully. Conditional models (or discriminative models) are trained for a particular task and thus a different model must be trained for missing data tasks. Gharamani et al [21] point out the exponential complexity growth of the number of models needed if one needs an optimal discriminative system for each possible task.

- Rejection of Poor or Corrupt Data
Sometimes, very poor data could be fed into the learning system and a generative model has the ability to detect this corrupt input and possibly signal the user to take some alternate measure.

It is important to note that the last two advantages occur infrequently in many applications and is the expected situation for the ARL framework. Typically, the system is called upon to perform the task it was trained for. Thus, the benefits of its superior performance over occasional missing data and poor data might rarely be noticed. In fact, on most standardized databases, the performance on the desired task will often be orders of magnitude more critical due to the infrequency of missing data or corrupt data.

The second and third advantages involve computational efficiency since discriminative or conditional models need to observe all data to be optimally trained. However, the need to observe all the data is not a disadvantage but truly an advantage. It allows a discriminative model to better learn the interactions between classes and their relative distributions for discrimination. Thus, as long as the discriminative model is not too computationally intensive and the volume of data is tractable, training on all the data is not a problem.

The most critical motivation for generative models in regression problems is actually the first advantage: the availability of superior inference algorithms (such as EM [15]). Typically, the training process for discriminative models (i.e. conditional densities) is cumbersome (i.e. neural network backpropagation and gradient ascent) and somewhat adhoc, requiring many re-initializations to converge to a good solution. However, tried and true algorithms for generative models (joint density estimation) avoid this and consistenly yield good joint models.

In fact, *nothing* prevents us from using both a generative model and a
discriminative model. Whenever the regular task is required and data
is complete and not corrupt, one uses a superior discriminative
model. Whenever missing data is observed, a joint model can be used to
`fill it in' for the discriminative model. In addition, whenever
corrupt data is possible, a *marginal* model should be used to
filter it (which is better than a joint and a conditional model for
this task). However, the bulk of the work the learning system will end
up doing in this case will be performed by the conditional model.

We now outline specific advantages of conditional models and discuss our approach to correct one of their major disadvantages: poor inference algorithms.

- Management of Limited Resources
Conditional or discriminative models utilize resources exclusively for accomplishing the task of estimating output from input observations. Thus, the limited resources (complexity, structures, etc.) will be used exclusively for this purpose and not squandered on irrelevant features that give no discrimination power.

- Simple Discrimination of Complex Generative Models
It is often the case that complex joint (generative) models can be easily separated by simple decision boundaries as in Figure 5.1. There is no need here to model the intrinsically complex phenomena themselves when it is so simple to discriminate the two different classes with two linear boundaries.

- Feature Selection
Conditional models by default do not need features that are as well chosen as joint models. Since spurious features will not help the discriminant model compute its output, they will be effectively ignored by it. A generative model might waste modeling power on these features even though they ultimately offer no discrimination power.

- Better Conditional Likelihood on Test Data
Typically, a learning system is trained on training data and tested on test data. Since in testing (either for joint or conditional models) we are always evaluating conditional likelihood (i.e. probability of guessing the correct class or guessing the right output) it is only natural that a model which optimizes this ability on training data will do better when tested (unless overfitting occurs).