next up previous contents
Next: Machine and Machine Up: Human and Machine Previous: Interaction Mode

Perceptual Feedback Mode

Of course, the learning system generates both a ${\hat {\bf
y}}_B(t)$ and a ${\hat {\bf y}}_A(t)$. Therefore, it would be of no extra cost to utilize the information in ${\hat {\bf y}}_A(t)$ in some way while the system is interacting with the user. Recall earlier the brief discussion of the similarity of this output to that of a filter (i.e. a Kalman filter). Instead of explicitly using Kalman filters in the vision systems (as described earlier), one could consider the predicted ${\hat {\bf y}}_A(t)$ as an alternative to filtering and smoothing. The ARL system then acts as a sophisticated non-linear dynamical filter. In that sense, it could be used to help the vision tracking and even resolve some vision system errors.


  
Figure 8.4: Perceptual Mode
\begin{figure}\center
\begin{tabular}[b]{c}
\epsfysize=2.8in
\epsfbox{perceptualMode.ps}
\end{tabular}
\end{figure}

Typically, tracking algorithms use a variety of temporal dynamic models to assist the frame by frame vision computations. The most trivial of these is to use the last estimate in a nearest neighbour approach to initialize the next vision iteration. Kalman filtering and other dynamic models involve more sophistication ranging from constant velocity models to very complex control systems. Here, the the feedback being used to constrain the vision system results from dynamics and behaviour modeling. This is similar in spirit to the mixed dynamic and behaviour models in [46]. In the head and hand tracking case, the system continuously feeds back prediction estimates of the 15 tracked parameters (3 Gaussians) in the vision system for improved results.

More significant vision errors can also be handled. Consider the specific case of head and hand tracking with skin blobs. As mentioned earlier, colored gloves were used to overcome some correspondence problems when heads and hands touched and moved by each other. The first training sequences involved no mis-correspondence due to explicit glove labeling of head, left hand and right hand. However, once appropriately trained, the probabilistic model described above feeds back the positions of the Gaussians to the vision. This prevents blob mislabeling by using the whole gesture as a predictor instead of short range dynamics. Thus, it is possible to recognize a blob as a hand from its role in a gesture and to maintain proper tracking. This permits us to reliably do away with colored gloves. In addition, a coarse model of $p({\bf x})$ is available and can be evaluated to determine the likelihood of any past interaction. If permutations of the blobs being tracked by the computer vision are occasionally tested with $p({\bf x})$, any mislabeling of the blob features can be detected and corrected. The system merely selects one of the 6 permutations of 3 blobs that maximizes $p({\bf x})$ and then feeds back the appropriate ${\hat {\bf y}}$ estimate to the computer vision. Instead of using complex static computations to resolve these ambiguities, a reliable correspondence between the blobs is computed from temporal information.


next up previous contents
Next: Machine and Machine Up: Human and Machine Previous: Interaction Mode
Tony Jebara
1999-09-15