CAREER: Querying Information Sources across the Internet

Luis Gravano
Computer Science Department
Columbia University

Contact Information

Luis Gravano
Computer Science Department
Columbia University
1214 Amsterdam Avenue
New York, NY 10027

Phone: +1-212-939-7064
Fax: +1-212-666-0140
Email: gravano@cs.columbia.edu
URL: http://www.cs.columbia.edu/~gravano

WWW Pages for Project:

QProber, a system for automatically classifying "hidden-web" text databases: http://qprober.cs.columbia.edu
SDARTS, a protocol and toolkit for metasearching: http://sdarts.cs.columbia.edu
RANK: Top-k query processing (e.g., over web-accessible databases): http://www.cs.columbia.edu/~nicolas/rank/
Snowball, an information-extraction system: http://snowball.cs.columbia.edu
GeoSearch, a geographically-aware search engine: http://geosearch.cs.columbia.edu

Supported Student: Amélie Marian, second-year Ph.D. student (http://www.cs.columbia.edu/~amelie)

Project Award Information

Award Number: IIS-9733880
Duration: 9/1/1998 through 8/31/2002
Current year: 3rd year (no-cost extensions granted)
Title: CAREER: Querying Information Sources across the Internet

Keywords: hidden-web databases, web search, metasearching, top-k query processing, information extraction

Project Summary
The goal of this research project is to help users find the information that they need over the Internet. Unfortunately, Internet information sources vary widely in the types of information and access interfaces they provide. Furthermore, the number of available sources is overwhelming to users. Therefore, exploiting this wealth of resources effectively presents challenging problems, some of which we have been addressing in this project:

Exploiting contents of "hidden web" databases: "Hidden web" databases contain information that is not crawlable and hence is ignored by traditional search engines. We developed an efficient algorithm for classifying such valuable databases through a small number of query probes derived using machine learning techniques [SIGMOD'01 paper; QProber web site at http://qprober.cs.columbia.edu].
Metasearching over text databases: We have continued to develop SDARTS, a toolkit to facilitate metasearching; SDARTS contains generic, easily configurable wrappers for locally available plain-text and XML document databases, and for remote web-accessible databases [JCDL'01 and JCDL'02 papers; SDARTS web site at http://sdarts.cs.columbia.edu].
Top-k query processing: We developed algorithms for processing "top-k" queries involving "attributes" handled by autonomous web-accessible databases; our algorithms handle databases supporting a variety of access interfaces, and attempt to minimize access to remote databases [ICDE'02 paper; RANK web site at http://www.cs.columbia.edu/~nicolas/rank/].
Information extraction from text documents: We developed the Snowball system for extracting structured information from text documents, starting with just a handful of examples of the tuples to be extracted [ACM DL'00 paper; Snowball web site at http://snowball.cs.columbia.edu].
Smart web search: We developed algorithms to classify web resources according to their "geographical scope," computed by analyzing the distribution of hyperlinks to the resources, as well as the resource contents [VLDB'00 paper; GeoSearch web site at http://geosearch.cs.columbia.edu].

The educational objectives include the expansion of the curriculum in databases and information systems at Columbia University. In particular, a new graduate-level course covers the latest trends in database and information systems research.

Publications and Products

Query- vs. Crawling-based Classification of Searchable Web Databases, L. Gravano, P. Ipeirotis, and M. Sahami, in IEEE Data Engineering Bulletin, vol. 25, no. 1, March 2002.
Evaluating Top-K Queries over Web-Accessible Databases, N. Bruno, L. Gravano, and A. Marian, in Proc. of the 18th IEEE International Conference on Data Engineering (ICDE 2002), 2002.
Extending SDARTS: Extracting Metadata from Web Databases and Interfacing with the Open Archives Initiative, P. Ipeirotis, T. Barry, and L. Gravano, to appear in Proc. of the Second ACM+IEEE Joint Conference on Digital Libraries (JCDL 2002), 2002.
Probe, Count, and Classify: Categorizing Hidden Web Databases, P. Ipeirotis, L. Gravano, and M. Sahami, in Proc. of the 2001 ACM SIGMOD International Conference On Management of Data, 2001.
Snowball: A Prototype System for Extracting Relations from Large Text Collections (demonstration), E. Agichtein, L. Gravano, J. Pavel, V. Sokolova, and A. Voskoboynik, in Proc. of the 2001 ACM SIGMOD International Conference on Management of Data, 2001.
SDLIP + STARTS = SDARTS: A Protocol and Toolkit for Metasearching, N. Green, P. Ipeirotis, and L. Gravano, in Proc. of the First ACM+IEEE Joint Conference on Digital Libraries (JCDL 2001), 2001.
Learning Search Engine Specific Query Transformations for Question Answering, E. Agichtein, S. Lawrence, and L. Gravano, in Proc. of the 10th International World-Wide Web Conference (WWW10), 2001.
Computing Geographical Scopes of Web Resources, J. Ding, L. Gravano, and N. Shivakumar, in Proc. of the 26th International Conference on Very Large Data Bases (VLDB'00), 2000.
Automatic Classification of Text Databases through Query Probing, P. Ipeirotis, L. Gravano, and M. Sahami, in Proc. of the ACM SIGMOD Workshop on the Web and Databases (WebDB'00), 2000. Also in LNCS Series no. 1997, Springer, 2001.
Combining Strategies for Extracting Relations from Text Collections, E. Agichtein, E. Eskin, and L. Gravano, in Proc. of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD 2000), 2000.
Snowball: Extracting Relations from Large Plain-Text Collections, E. Agichtein and L. Gravano, in Proc. of the 5th ACM International Conference on Digital Libraries (DL'00), 2000.
Exploiting Geographical Location Information of Web Pages, O. Buyukkokten, J. Cho, H. Garcia-Molina, L. Gravano, N. Shivakumar, in Proc. of the ACM SIGMOD Workshop on the Web and Databases (WebDB'99), 1999.
GlOSS: Text-Source Discovery over the Internet, L. Gravano, H. Garcia-Molina, A. Tomasic, in ACM Transactions on Database Systems, vol. 24, no. 2, Jun. 1999.
Software: SDARTS, a protocol and toolkit for metasearching, at http://sdarts.cs.columbia.edu (source code and documentation of toolkit available at this web site).

Indication of Success and Project Impact

Amélie Marian, a second-year Ph.D. student at Columbia currently funded by this grant, has been working on the problem of processing "top-k" queries over autonomous, web-accessible databases. Amélie presented a paper with her results at the IEEE ICDE 2002 conference in March.
Eugene Agichtein, a fourth-year Ph.D. student at Columbia who used to be funded by this grant, has continued to work on the problem of extracting structured information from WWW pages. (Structured information is better suited for querying, for example.) Eugene presented a paper describing his system, Snowball, at the ACM Digital Libraries 2000 conference. Eugene presented another Snowball paper, which describes an alternative way to produce extraction patterns through Sparse Markov Transducers, at the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. Finally, a demo of the Snowball system was presented at the ACM SIGMOD 2001 conference. Eugene has also worked on the problem of question-answering over the web. We published a paper in collaboration with Steve Lawrence, from NEC Research, at the 10th International World-Wide Web Conference (WWW10) in 2001.
Panagiotis Ipeirotis, a third-year Ph.D. student at Columbia funded by a different grant, has been working on the problem of automatically classifying "hidden-web" databases into a Yahoo!-like hierarchy of topics automatically. Our work, in collaboration with Mehran Sahami from E.piphany Inc., uses machine learning tools to train a rule-based document classifier over the topic hierarchy of choice, and then turns the classifier rules into queries to adaptively probe the text databases. Our technique manages to classify web databases accurately through a small number of query probes and without retrieving any documents, by just exploiting the number of matches for each query probe. Preliminary work on this problem was presented at the ACM SIGMOD Workshop on the Web and Databases (WebDB 2000), and a full paper describing our technique and its evaluation using over 100 real web-accessible databases appeared in the ACM SIGMOD 2001 conference. Panagiotis has also been the leader of the SDARTS effort described above, which resulted in two papers (at the ACM+IEEE Joint Conference in Digital Libraries (JCDL) 2001 and 2002); the source code for the SDARTS toolkit is publicly available at the SDARTS web site.
Junyan Ding, a fourth-year Ph.D. student at Columbia funded by a different grant, worked until 12/2000 on the problem of indexing and extracting non-traditional properties of WWW resources. Junyan presented a paper that we wrote in collaboration with Narayanan Shivakumar, from Gigabeat Inc., in the VLDB 2000 conference.
Jon Oringer, a then MS student at Columbia, worked over summer 1999 for this project, and created a geographically-aware search engine that implements the ideas that we have developed with Junyan Ding and Narayanan Shivakumar. This search engine is available at http://geosearch.cs.columbia.edu.
In addition, several undergraduate and MS students have contributed to different aspects of the activities above, and have done research-oriented projects for credit towards their degrees.
I have created a graduate-level course on advanced database systems that expanded the curriculum on databases in particular, and on information systems in general, at Columbia. The enrollments in this course have been soaring: 23 students in spring'98, 42 students in spring'99, 58 students in spring'00, and 67 students in spring'02.

Area References

Optimal Aggregation Algorithms for Middleware, R. Fagin, A. Lotem, and M. Naor, in Proceedings of the 20th Symposium on Principles of Database Systems (PODS 2001), May 2001.
Query-based Sampling of Text Databases, J. Callan and M. Connell, in ACM Transactions on Information Systems, vol. 19, no. 2, April 2001.
Extracting Patterns and Relations from the World-Wide Web, S. Brin, in Proceedings of the International Workshop on the Web and Database (WebDB'98), March 1998.
The Anatomy of a Large-scale Hypertextual Web Search Engine, S. Brin and L. Page, in Proceedings of the Seventh International World-Wide Web Conference (WWW-7), April 1998.
Searching Distributed Collections with Inference Networks, J. Callan, Z. Lu, and W. B. Croft, in Proceedings of the 18th ACM SIGIR conference (SIGIR'95), July 1995.
Authoritative Sources in a Hyperlinked Environment, J. Kleinberg, in Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, January 1998.