This project explores three new related approaches to making the indexing and retrieval of videos (1) more efficient by developing rapid feature selection, (2) more meaningful by devising measurably useful indexing ontologies, and (3) more humanly navigable by demonstrating integrated multimedia browsers, even when the videos are unedited.
The first approach consists of heuristic adaptations of machine learning algorithms to the overly redundant data that is video. The second approach consists of using statistical tools to examine the descriptive tags that can be affixed to video segments, in order to measure and refine their quality. The third approach consists of exploiting the first two approaches to discover and display the weak structure that is latent even in the unedited videos of student presentations. The experimental research will be refined by continuing user studies, involving students from middle school, high school, college, and post-graduates.
The results of this project will provide advances at the multiple intersections of computer vision, machine learning, data management, information retrieval, ontology design, user interface technology, and user studies, with possible applications in the wider areas of sensory retrieval in general.
Broader impacts: We expect that the browser will enhance the effectiveness of undergraduate education by allowing accurate and rapid review of both instructor and student recorded presentations.
We also expect that the underlying novel technologies will permit real-time analysis and access of more standard videos. The project Web site (http://www.cs.columbia.edu/~jrk/unstructured) will be used to disseminate resulting publications, open-source code, and instructions on how to obtain annotated video data sets.
First and second year findings:
We have developed a browser and tested it through user studies, which shows that contrary to intuition, users locate video segments of interest more quickly and more accurately if the video itself is inaccessible. They do not waste as much time watching the video, but rather use the derived features to sample and home in on the relevant segments instead. This ability to search candidly captured lectures has shown statistically significant positive effects on student performance.
We are able to construct automatic taggers using fewer than one-sixteenth of the imagery information, by efficiently and heuristically learning which locations, resolutions, and image structures are most likely to indicate support for a semantic concept.
We find we can adapt standard OCR techniques to locate significant words for indexing purposes, even under bad lighting and camera conditions, using a novel way of precompiling spatial information ('integral rectangles') in the area near the words, even if it is overwhelmed by background objects.
We have devised a novel way of selecting the most representative faces from the video to serve as icons as index markers for time sequences of high interest.
We have established a hierarchical ontology of semantic tags for unstructured videos, based on mechanically applied rules of good formation derived from the OntoClean methodology. This directly allows rapid selection of features for more refined concepts, based on a 'cascade' of machine learning algorithms, as the concept heierarchy is traversed downward. For example, having quickly determined the best way to look for those video frames that contain text, it is faster and more accurate to use these results to determin the best way to look for text that shows code segments or tables.Supported Graduate Students:
This site includes a description and a demonstration of the VAST MM browser for unstructured videos. This Video Indexing and Browsing tool is designed for unstructured presentation videos (lectures, talks, etc.), particularly in the domain of candidly captured student presentation videos. It demonstrates several integrated approaches for multi-modal analysis and indexing of audio and video. It applies visual segmentation techniques on unedited video to determine likely changes of topics. Speaker segmentation methods are employed to determine individual student appearances, which are linked to extracted headshots to create a visual speaker index. Videos are augmented with time-aligned filtered keywords and phrases from highly inaccurate speech transcripts. This user interface, the VAST MM Browser (Video Audio Structure Text Multi Media Browser), combines streaming videos, visual, and textual indices for browsing and searching. It has been evaluated in a Media Browser), combines streaming videos, visual, and textual indices for browsing and searching. It has been evaluated in a large engineering design course over four semesters and 598 student participants. Results on student performance suggest that our video indexing and retrieval approach is effective, and that the exam scores of students using the browser significantly increased.
Browseable indices include:
Thumb: snapshots as thumbnails
Time: a timeline with timestamps at points of visual change
Audio: an audio track (colored green) with audio activity
Video: a video track (colored red) with markers of varying intensity representing visual change
Content: a text track with keywords and phrases taken from the speech transcript - intensity of the text blips hints at descriptiveness and recurrence of terms
Bookmark: personal annotations that are only viewable by the logged-in user
Annotation: publically viewable annotations, marked with the annotator's username