We have two methods for handling very short queries. First, the relevance of a document D to a query Q is measured as the average similar between Q and a cluster of documents similar to D. This reduces the estimated relevance of documents that are only tangentially related to the query. Second, relevant documents that are similar are represented by a single document in the list given to the user, which increases the diversity of documents shown to the user.
Finding facts will be handled by hidden Markov model (HMM) queries generated automatically and refined through user feedback. Given a template for a fact, for example "director X of movie Y", an initial HMM generalizing the template will be assembled from a library of HMM components for standard grammatical units such as noun phrases. The keywords appearing in the HMM will be used for a standard Boolean search, and then the HMM will be used to score sentences in the retrieved documents. The user will specify which of the highest- scoring sentences actually contain facts of interest. A training algorithm will then revise the HMM to maximize the score of these sentences, and the search process will be repeated.
Software for extracting facts from text using HMMs has been implemented by Tim Leek, and the methods mentioned above for handling short queries have been implemented by Onn Brandman. Lessons learned from these prototypes will be discussed.