One real-time application developed by Jebara and Pentland [32] is the automatic real-time 3D face tracking system shown in Figure 11. An automatic initialization module finds the face, locating eyes, nose and mouth coordinates in under a second. These are then used to initialize 8 normalized correlation tracking squares (i.e. sum-squared distance minimization [22]) on the face.

Each square can translate, rotate and scale and so is equivalent to
two 2D point features (Figure 12(a)(b)(c)). The
resulting 16 features are fed into the SfM algorithm resulting in the
recovery of 16 *rigid* 3D points. This estimated rigid 3D model is
then reprojected onto the image plane to generate a set of 16 *rigidly constrained* 2D points. These points are used to relocate the
individual trackers for tracking the motion in the next frame. The
trackers estimate an instantaneous trajectory yet are not permitted to
follow through with it (i.e. in a nearest-neighbor tracking
framework). Instead, this estimate is used in the SfM which computes
the corresponding rigid trajectory and repositions the trackers along
this rigid 'path' for the next frame in the sequence. Thus, instead of
letting each square individually track, the SfM couples them all,
forcing them to behave as if they were glued onto a rigid 3D body
(i.e. a 3D face). Furthermore, the 8 trackers output an error level
which can be used in the ** R** matrix in the SfM Kalman filtering to
adaptively weight good features more than bad features in the 3D
estimates. Feature errors are mapped into a Gaussian uncertainty in
localization by an initial perturbation analysis which computes each
tracker's error sensitivity under small displacements.

The end result is a much more stable tracking framework (operating at 30Hz). If some trackers are occluded or fail, the others pull them along via the imposed rigidity constraint. The feedback from the adaptive Kalman filter maintains a sense of 3D structure and enforces a global collaboration between the separate 2D trackers. Thus, tracking remains stable for minutes instead of seconds (if no feedback SfM is used). Figure 12(d) depicts the stability under occlusion where a mouth and eye tracker are distracted by the presence of the user's finger. Similarly in Figure 12(e), the mouth tracker is distracted by deformation (smiling) where the mouth is no longer similar to the closed mouth the template was initialized with. These conditions remain stable due to the feedback loop.

The algorithm also re-initializes when it detects that it has lost the face as in Figure 13. This detection is performed via the so-called ``Distance-from-Face-Space'' calculation which essentially computes the probability of a face pixel image with respect to a constrained Gaussian distribution [40]. While multiple real and synthetic tests show very strong convergence we have also used the system extensively in the above real-time application settings where it behaved consistently and reliably.