The first scenario concerns the exploitation of linguistic context in vision systems. Linguistic context is qualitative in nature and is obtained dynamically. We view this as a new paradigm which is a golden mean between data driven object detection and site-model based vision. Our theory for collateral-based vision includes goal-directed NLP, suitable knowledge representations, and efficient search strategies. The design and implementation of a system, Show&Tell, a multimedia system for semi-automated image annotation is discussed. This system, which combines advances in speech recognition, natural language processing and image understanding, is designed to facilitate the work of image analysts.
The second scenario concerns the interaction of textual and photographic information in multimodal documents. The World Wide Web (WWW) may be viewed as the ultimate, large-scale, dynamically changing, multimedia database. Finding useful information from the WWW poses a challenge in the area of multimodal information indexing and retrieval. The word ``indexing'' is used here to denote the extraction and representation of semantic content. Our research focuses on improving precision and recall in a multimodal information retrieval system by interactively combining text processing with image processing. We exploit the fact that images do not appear in isolation, but rather with accompanying text which we refer to as collateral text. The interaction of text and image content takes place in both the indexing and retrieval phases. An application of this research, namely a picture search engine which permits a user to retrieve pictures of people in various contexts will be discussed.