next up previous contents
Next: Conditional Densities: A Bayesian Up: Joint versus Conditional Densities Previous: Joint versus Conditional Densities

Squandered Resources: The Shoe Salesman Problem

We shall now describe an entertaining estimation problem and see how it argues convincingly for the conditional (discriminant) models instead of joint (generative) models. Let us consider the case of a shoe salesman who has to learn how to fit his clients with shoes of the right size. Each client has a file containing all their personal information except shoe size. The shoe salesman has access to this data and knows a variety of things about his clients. Some of these features include: hair color, eye color, height, weight, sex, age, etc. In fact, let us say that each personal file contains 999 numerical values (features) describing one client. Of course, shoe size is not in the file and the first few hundred clients that visit the salesman get their shoe size measured directly using a ruler. However, the salesman (for whatever reason) finds this a tedious process and wishes to automate the shoe size estimation. He decides to do this by learning what shoe size to expect for a client exclusively from the personal file. Assume that each client mails the shoe salesman his personal file before coming in to get new shoes.


  
Figure 5.2: Large Input Data, Small Output Estimation
\begin{figure}\center
\begin{tabular}[b]{c}
\epsfxsize=2.2in
\epsfbox{shoe.ps}
\end{tabular}\end{figure}

Figure 5.2 depicts this learning problem. Given the 999 measurements (called ${\bf x}$) (which will always be sent to the salesman), what is the correct shoe size (${\bf y}$)? The problem with fitting a generative model to this data is that it will make no distinction between ${\bf x}$ and ${\bf y}$. These variables are simply clumped together as measurements in a big joint space and the model tries to describe the overall phenomenon governing the features. As a result, the system will be swamped and learn useless interactions between the 999 variables in ${\bf x}$ (the file). It will not focus on learning how to use the ${\bf x}$ to compute ${\bf y}$ but instead equally worry about estimating any variable in the common soup from any possible observation. Thus, the system will waste resources and be distracted by spurious features that have nothing to do with shoe size estimation (like hair and eye color) instead of optimizing the ones that do (height, age, etc.).


  
Figure 5.3: Distracting Features
\begin{figure}\center
\begin{tabular}[b]{c}
\epsfxsize=5.2in
\epsfbox{shoeJoint.ps}
\end{tabular}\end{figure}

A misallocation of resources occurs when a generative model is used as in Figure 5.3. Note how the model would place different linear sub-models to describe the interaction between hair darkness and eye darkness. This is wasteful for the application since these attributes have nothing to do with shoe size as can be seen by the lack of correlation between hair color and shoe size. In fact, only observing the height of the client seems to be good to determine the shoe size. The misallocation of resources is a symptom of the joint density estimation in $p({\bf x},{\bf y})$ which treats all variables equally. A discriminatory or conditional model that estimates $p({\bf y}\vert{\bf x})$ does not use resources for estimation of hair darkness or some useless feature. Instead, only the estimation of ${\bf y}$, the shoe size, is of concern. Thus, as the dimensionality of ${\bf x}$ increases (i.e. 999), more resources are wasted modeling other features and performance on ${\bf y}$ usually degrades even though information is being added to the system. However, if it is made explicit that only the ${\bf y}$ output is of value, additional features in ${\bf x}$ should be more helpful than harmful.

We will now show that conditional or discriminative models fit nicely into a fully probabilistic framework. The following chapters will outline a monotonically convergent training and inference algorithm (CEM, a variation on EM) that will be derived for conditional density estimation. This overcomes some of the adhoc inference aspects of conditional or discriminative models and yields a formal, efficient and reliable way to train them. In addition, a probabilistic model allows us to manipulate the model after training using principled Bayesian techniques.


next up previous contents
Next: Conditional Densities: A Bayesian Up: Joint versus Conditional Densities Previous: Joint versus Conditional Densities
Tony Jebara
1999-09-15