Conditional Density Estimation

Next: Optimization - General Bound Up: Conditional Densities: A Bayesian Previous: Conditional Densities: A Bayesian

Conditional Density Estimation

Having obtained a $p({\bf z} \vert {\cal Z} )$ or, more compactly a $p({\bf z})$ , we can compute the probability of any point ${\bf z}$ in the vector space^5.2. However, evaluating the pdf in such a manner is not necessarily the ultimate objective. Often, some components of the vector are given as input ( ${\bf x}$ ) and the learning system is required the estimate the missing components as output^5.3 ( ${\bf y}$ ). In other words, ${\bf z}$ can be broken up into two sub-vectors ${\bf x}$ and ${\bf y}$ and a conditional pdf is computed from the original joint pdf over the whole vector as in Equation 5.3. This conditional pdf is $p({\bf y} \vert {\bf x})^j$ with the j superscript to indicate that it is obtained from the previous estimate of the joint density. When an input ${\bf x}'$ is specified, this conditional density becomes a density over ${\bf y}$ , the desired output of the system. This density is the required function of the learning system and if a final output estimate ${\bf \hat{y}}$ is need, the expectation or arg max can be found via Equation 5.4.

$\begin{displaymath}p({\bf y} \vert {\bf x})^j =\frac{ p({\bf z}) } { \int p({\bf... ...\bf x} \vert\Theta) p(\Theta \vert {\cal X},{\cal Y}) d\Theta} \end{displaymath}$

(5.3)

$\displaystyle {\bf\hat{y}} = \left\{ \begin{array}{l} \arg\max p({\bf y} \vert {\bf x}') \\ \int {\bf y} p({\bf y} \vert {\bf x}') d{\bf y} \end{array}\right.$

(5.4)

Obtaining a conditional density from the unconditional (i.e. joint) probability density function in such a roundabout way can be shown to be suboptimal. However, it has remained popular and is convenient partly because of the availability of powerful techniques for joint density estimation (such as EM).

If we know a priori that we will need the conditional density, it is evident that it should be estimated directly from the training data. Direct Bayesian conditional density estimation is defined in Equation 5.5. The vector ${\bf x}$ (the input or covariate) is always given and the ${\bf y}$ (the output or response) is to be estimated. The training data is of course also explicitly split into the corresponding ${\cal X}$ and ${\cal Y}$ vector sets. Note here that the conditional density is referred to as $p({\bf y} \vert {\bf x})^c$ to distinguish it from the expression in Equation 5.3.

$\displaystyle \begin{array}{ll} p({\bf y} \vert {\bf x})^c & = p({\bf y} \vert ... ...vert{\bf x},\Theta^c) p(\Theta^c \vert {\cal X},{\cal Y}) d\Theta^c \end{array}$

(5.5)

Here, $\Theta^c$ parametrizes a conditional density $p({\bf y}\vert{\bf x})$ . $\Theta^c$ is exactly the parametrization of the conditional density $p({\bf y}\vert{\bf x})$ that results from the joint density $p({\bf x},{\bf y})$ parametrized by $\Theta$ . Initially, it seems intuitive that the above expression should yield exactly the same conditional density as before. It seems natural that p(y|x)^c should equal p(y|x)^j since the $\Theta^c$ is just the conditioned version of $\Theta$ . In other words, if the expression in Equation 5.1 is conditioned as in Equation 5.3, then the result in Equation 5.5 should be identical. This conjecture is wrong.

Upon closer examination, we note an important difference. The $\Theta^c$ we are integrating over in Equation 5.5 is not the same $\Theta$ as in Equation 5.1. In the direct conditional density estimate (Equation 5.5), the $\Theta^c$ only parametrizes a conditional density $p({\bf y}\vert{\bf x})$ and therefore provides no information about the density of ${\bf x}$ or ${\cal X}$ . In fact, we can assume that the conditional density parametrized by $\Theta^c$ is just a function over ${\bf x}$ with some parameters. Therefore, we can essentially ignore any relationship it could have to some underlying joint density paramtrized by $\Theta$ . Since this is only a conditional model, the term $p(\Theta^c \vert {\cal X},{\cal Y})$ in Equation 5.5 behaves differently than the similar term $p(\Theta \vert {\cal Z}) = p(\Theta \vert {\cal X},{\cal Y})$ in Equation 5.1. This is illustrated in the manipulation involving Bayes rule shown in Equation 5.6.

$\displaystyle \begin{array}{ll} p(\Theta^c \vert {\cal X},{\cal Y}) & = \frac{ ... ...{\cal X}) {\bf p} ({\cal X}) p(\Theta^c)} { p ({\cal X},{\cal Y}) } \end{array}$

(5.6)

In the final line of Equation 5.6, an important manipulation is noted: $p( {\cal X} \vert \Theta^c)$ is replaced with $p({\cal X})$ . This implies that observing $\Theta^c$ does not affect the probability of ${\cal X}$ . This operation is invalid in the joint density estimation case since $\Theta$ has parameters that determine a density in the ${\cal X}$ domain. However, in conditional density estimation, if ${\cal Y}$ is not also observed, $\Theta^c$ is independent from ${\cal X}$ . It in no way constrains or provides information about the density of ${\cal X}$ since it is merely a conditional density over $p({\bf y}\vert{\bf x})$ . The graphical models in Figure 5.4 depict the difference between joint density models and conditional density models using a directed acyclic graph [35] [28]. Note that the $\Theta^c$ model and the ${\cal X}$ are independent if ${\cal Y}$ is not observed in the conditional density estimation scenario. In graphical terms, the $\Theta$ joint parametrization is a parent of the children nodes ${\cal X}$ and ${\cal Y}$ . Meanwhile, the conditional parametrization $\Theta^c$ and the ${\cal X}$ data are co-parents of the child ${\cal Y}$ (they are marginally independent). Equation 5.7 then finally illustrates directly estimated conditional density solution $p({\bf y} \vert {\bf x})^c$ .

**Figure 5.4:** The Graphical Models
$\begin{figure}\center \begin{tabular}[b]{cc} \epsfxsize=1.2in \epsfbox{DAGjo... ...sity Estimation & (b) Conditional Density Estimation \end{tabular}\end{figure}$

$\displaystyle \begin{array}{ll} p({\bf y} \vert {\bf x}) ^c & = \int p( {\bf y}... ...l X}) p(\Theta^c) d\Theta^c \:\:\:\: / \: p ({\cal Y}\vert{\cal X}) \end{array}$

(5.7)

The Bayesian integration estimate of the conditional density appears to be different and inferior from the conditional Bayesian integration estimate of the unconditional density. ^5.4 The integral (typically) is difficult to evaluate. The corresponding conditional MAP and conditional ML solutions are given in Equation 5.8.

$\displaystyle p({\bf y} \vert {\bf x} ) ^c \approx p({\bf y} \vert {\bf x}, \Th... ...\arg\max p({\cal Y} \vert \Theta^c , {\cal X}) & \mbox{ML}^c \end{array}\right.$

(5.8)

At this point, the reader is encouraged to read the Appendix for an example of conditional Bayesian inference ( $p({\bf y} \vert {\bf x})^c$ ) and how it differs from conditioned joint Bayesian inference ( $p({\bf y} \vert {\bf x})^j$ ). From this example we note that (regardless of the degree of sophistication of the inference) direct conditional density estimation is different and superior to conditioned joint density estimation. Since in many applications, full Bayesian integration is computationally too intensive, the ML^c and the MAP^c cases derived above will be emphasized. In the following, we shall specifically attend to the conditional maximum likelihood case (which can be extended to the MAP^c) and see how General Bound Maximization (GBM) techniques can be applied to it. The GBM framework is a set of operations and approaches that can be used to optimize a wide variety of functions. Subsequently, the framework is applied to ML^c and MAP^c expressions that were advocated above to find their maximum. The result of this derivation is the Conditional Expectation Maximization (CEM) algorithm which will be the workhorse learning system we will be using for the ARL training data.

Next: Optimization - General Bound Up: Conditional Densities: A Bayesian Previous: Conditional Densities: A Bayesian

Tony Jebara
1999-09-15