Audio-Visual Associative Memory System

Next: Object Recognition System Up: System Overview Previous: System Overview

Audio-Visual Associative Memory System

The audio-visual associative memory operates on a record-and-associate paradigm. Audio-visual clips are recorded by the push of a button and then associated to an object of interest. Subsequently, the audio-visual associative memory module receives object labels along with confidence levels from the object recognition system. If the confidence is high enough, it retrieves from memory the audio-visual information associated with the object the user is currently looking at and overlays this information on the user's field of view.

The audio-visual recording module accumulates buffers containing audio-visual data. These circular buffers contain the past 2 seconds of compressed audio and video. Whenever the user decides to record the current interaction, the system stores the data until the user signals the recording to stop. The user moves his head mounted video camera and microphone to specifically target and shoot the footage required. Thus, an audio-video clip is formed. After recording such a clip, the user selects the object that should trigger the clip's playback. This is done by by directing the camera towards an object of interest and triggering the unit (i.e. pressing a button). The system then instructs the vision module to add the captured image to its database of objects and associate the object's label to the most recently recorded A/V clip. Additionally, the user can indicate negative interest in objects which might get misinterpreted by the vision system as trigger objects (i.e. due to their visual similarity to previously encountered trigger-objects). Thus, both positive and negative reinforcement can be used in forming these associations. Therefore the user can actively assist the system to learn the differences between uninteresting objects and important cue objects.

The primary functionality of DyPERS is implemented in a simple 3 button interface (via a wireless mouse or a notebook PC with a wireless WaveLan). The user can select from a record button, an associate button and a garbage button. The record button stores the A/V sequence. The associate button merely makes a connection between the currently viewed visual object and the previously recorded sequence. The garbage button associates the current visual object with a NULL sequence indicating that it should not trigger any play back. This helps resolve errors or ambiguities in the vision system. This association process is shown in Figure 2. A simple 3-command speech interface could also be incorporated following the same paradigm.

**Figure:** Associating A/V Sequences to Objects
$\begin{figure} \centerline{\psfig{figure=fig-uist-98/association2.eps,width=3.1in}} \end{figure}$

Whenever the user is not recording or associating, the system is continuously running in a background mode trying to find objects in the field of view which have been associated to an A/V sequence. DyPERS thus acts as a parallel perceptual remembrance agent that is constantly trying to recognize and explain - by remembering associations - what the user is paying attention to. Figure 3 depicts an example of the overlay process. Here, in the top of the figure, an 'expert' is demonstrating how to change the bag on a vacuum cleaner. The user records the process and then associates the explanation with the image of the vacuum's body. Thus, whenever the user looks at the vacuum (as in the bottom of the figure) he or she automatically sees an animation (overlaid on the left of his field of view) explaining how to change the dust bag. The recording, association and retrieval processes are all performed online in a seamless manner.

**Figure:** Sample Output Through HUD
$\begin{figure}\center \begin{tabular}[b]{c} \epsfysize=1.5in \epsfbox{fig-uist-98/berntvacuum2.ps} \end{tabular}\end{figure}$

Next: Object Recognition System Up: System Overview Previous: System Overview

Tony Jebara
1998-10-07