Informal Internet Standards at Stanford

(Joint work with Kevin Chang, Hector Garcia-Molina, and Andreas Paepcke)

Stanford University

Document databases are available everywhere, both within the internal networks of the organizations and on the Internet. The database contents are often "hidden" behind search interfaces. These interfaces vary from database to database. Also, the algorithms with which the associated search engines rank the documents in the query results are usually incompatible across databases. Even individual organizations use search engines from different vendors to index their internal document collections. These organizations could benefit from unified query interfaces to multiple search engines, for example, that would give users the illusion of a single big document database. Building such "metasearchers" is nowadays a hard task because different search engines are largely incompatible and do not allow for interoperability.

Given a query, a metasearcher has to perform (at least) three tasks to provide a unified interface over a (large) number of document databases:

Choose the best databases to evaluate the query
Evaluate the query at these databases
Merge the query results from these databases

The existing search engines do not help with the three tasks above. In general, text search engines:

Do not export information about the sources (the metadata problem)
Use different query languages (the query-language problem)
Rank documents in the query results using secret algorithms (the rank-merging problem)

To improve this situation, the Digital Library project at Stanford is coordinating among search-engine vendors (Fulcrum, Verity and WAIS) and other key players (Hewlett-Packard Laboratories, Infoseek, and Microsoft Network) to reach informal agreements for unifying basic interactions in these three areas . We have also received input from representatives of GILS, Harvest, Netscape, and PLS. In particular, our proposal specifies the summaries of the source contents that the search engines should export to assist in database selection (e.g., these summaries include the vocabulary of each source). We also define a simple, extensible query language with commonly supported features, drawing heavily from the Z39.50-1995 standard. Finally, we identify the information that the search engines should return with the query results in order to merge multiple document ranks meaningfully. (Latest draft of the informal standards.)

This position paper is for the Distributed Indexing/Searching Workshop held in Cambridge, Massachusetts on May 1996.