Resource Indexing and Discovery
In a Globally Distributed Digital Library

Position Paper for an NSF-EU Digital Library Collaboratory Working Group
Budapest, Hungary, November 1997

The Internet has grown dramatically over the past few years. Information sources are available everywhere. Unfortunately, these information sources vary widely in the types of information and access interfaces that they provide. Therefore, using this wealth of resources effectively presents interesting and challenging problems. In effect, users have information needs, and should not be concerned with the format of the available data, or with the interface and access capabilities of the data sources.

Increasingly, users want to issue complex queries across Internet sources to obtain the data they require. Because of the size of the Internet, it is not possible anymore to process such queries in naive ways, e.g., by accessing all the available sources. Thus, we must process queries in a way that scales with the number of sources. Also, sources vary in the type of information objects they contain and in the interface they present to their users. Some sources contain text documents and support simple query models where a query is just a list of keywords. Other sources contain more structured data and provide query interfaces in the style of relational databases. User queries might require accessing sources supporting radically different interfaces and query models. Thus, we must process queries in a way that deals with heterogeneous sources.

Users should be able to express their information needs and receive the relevant data even when finding this data requires accessing sources of textual and non-textual documents, or sources that do not cooperate by exporting content summaries, for example. Furthermore, users should receive this data ordered starting from those objects that are potentially most useful, because the number of objects that match a query might be very large. Many problems need to be solved before we can provide users with sophisticated, seamless, and transparent access to the large number and variety of Internet sources. Below is a description of some of these problems, which range from improving systems that already exist (e.g., WWW search engines for HTML documents), to dealing with sources that are currently largely ignored by WWW search engines (e.g., ``uncooperative,'' non-HTML text sources, relational databases, image repositories).

Query specification/user interface

There is much more to a description of an information need than a simple list of words. In effect, when users look for information, they often have many other requirements in mind. The following are just a few examples of these requirements:

The right ``register'': for example, a scientific research report, a gossip column from a tabloid, or a university's academic calendar
The right ``geographic relevance'': for example, a ``locally relevant'' resource, or a ``globally relevant'' resource, but most likely not a resource that is only relevant to some county, say, that is far away from the user's residence
The right ``popularity level'': for example, a really popular resource, accessed and referred to massively, or an obscure resource that nobody knows about
...

A challenge is to provide user interfaces and systems that would manage to gather these user requirements without overwhelming unsophisticated users with too much complexity.

Smart Query Processing over Text Documents

Current WWW search engines generally do a poor job at ranking pages for a given user query. Typically, these engines rank the available WWW pages for the query based on the pages' contents. These page ranks are computed by following variants of the vector-space and probabilistic retrieval models developed over the years by the information retrieval community. The number of WWW pages and the wide difference in their quality and scope make this approach inappropriate in many cases: users are overwhelmed with large numbers of highly ranked, low quality pages that happen to include the query words many times.

An interesting problem is to use all available information on the WWW to do a better job at ranking documents for queries, taking into consideration the special user requirements that we discussed above. A key challenge in mining all this information for query processing is efficiency, since the volume of the information at hand is extremely large, and growing fast. Promising sources of information to employ include available citation information (e.g., as in Stanford's BackRub system), query logs, response times, user feedback, and quality reports. For example, initial work on mining query logs tries to predict what pages are likely to be useful to users based on their browsing behavior and that of previous users.

Resource Discovery over Search-Only Text Sources

Search engines currently ignore the non-HTML contents of sources that are ``hidden'' behind search interfaces. In effect, search engines cannot ``crawl'' inside of such sources and follow links to extract all documents in the sources. Therefore, we have to resort to other mechanisms to reason about the sources contents, and determine that they are relevant to users' information needs. A possible solution to this problem is to have sources cooperate by exporting content summaries and metadata following a known protocol (e.g., Z39-50's Explain facility, or the STARTS protocol proposal). However, if sources do not cooperate, then we have to devise alternative mechanisms for extracting meaningful content summaries automatically. Dealing with uncooperative sources is a crucial problem, since many high-quality information sources (e.g., the Internet Movie Database) follow into this category.

Resource Discovery over Arbitrary Sources

So far we have discussed resource discovery over sources of text documents. However, many sources on the Internet host other kinds of information, like ``relational-like'' data, images, and video. A particularly challenging open issue is how to summarize the contents of such sources in an automatic and scalable way so that we can reason about the sources when processing user queries.

Ultimately, our goal is to allow transparent query processing over sources with varying data types. For example, users should be able to issue queries whose processing involves accessing text, relational, and image sources. Before we can process queries that span several source and data types, we need to address the following issues:

Defining the meaningful combinations of data types and operations. To extract the information that users need, we might need to perform join-like operations involving, say, two repositories of text documents.
Defining expressive query languages. As we mentioned above, users should express their requests using simple interfaces. We should then translate user requests into queries written in a query language that models the wide variety of sources and data types available on the Internet.
Defining efficient execution plans for queries spanning several source and data types. Finally, once we have produced a complex query expressed in the query language discussed above and reflecting the user's information need, we have to design efficient, incremental query plans to execute it. Producing these plans involves deciding what sources are relevant for evaluating the different query pieces, evaluating these pieces at the sources using the available interfaces and query models, and finally combining the answers produced by the sources into a coherent query result for the user that issued the query.

Luis Gravano

gravano@cs.columbia.edu