CAREER: Querying Information Sources across the Internet

CAREER: Querying Information Sources across the Internet
NSF Award IIS-9733880

Principal Investigator
Luis Gravano
Computer Science Department
Columbia University
1214 Amsterdam Avenue
New York, NY 10027
Phone: 1-212-939-7064
Fax: 1-212-666-0140
Email: gravano@cs.columbia.edu
Homepage: http://www.cs.columbia.edu/~gravano/

Keywords
web search, metasearching, hidden-web databases, information extraction, top-k query processing

Project Summary
The goal of this research project is to help users find the information that they need over the Internet. Unfortunately, Internet information sources vary widely in the types of information and access interfaces they provide. Furthermore, the number of available sources is overwhelming to users. Therefore, exploiting this wealth of resources effectively presents challenging problems, some of which we have been addressing in this project:

Exploiting contents of "hidden web" databases: "Hidden web" databases contain information that is not crawlable and hence is ignored by traditional search engines. We developed an efficient algorithm for classifying such valuable databases through a small number of query probes derived using machine learning techniques [TOIS'03 paper; SIGMOD'01 paper].
We also extract fine-grain, accurate content summaries of the database contents via hierarchical query probes, and use these content summaries to integrate database classification with distributed searching (i.e., metasearching) via a hierarchical database selection algorithm [VLDB'02 paper;
QProber web site at http://qprober.cs.columbia.edu].
Metasearching over text databases: We have continued to develop SDARTS, a toolkit to facilitate metasearching; SDARTS contains generic, easily configurable wrappers for locally available plain-text and XML document databases, and for remote web-accessible databases [JCDL'01 and JCDL'02 papers; SDARTS web site at http://sdarts.cs.columbia.edu].
Top-k query processing: We developed algorithms for processing "top-k" queries involving "attributes" handled by autonomous web-accessible databases; our algorithms handle databases supporting a variety of access interfaces, and attempt to minimize access to remote databases [ICDE'02 paper; RANK web site at http://rank.cs.columbia.edu]. We also handle relational [
TODS'02 paper] and multimedia [
upcoming TKDE paper] databases.
Efficient information extraction from text documents: We developed the Snowball system for extracting structured information from text documents, starting with just a handful of examples of the tuples to be extracted [ACM DL'00 paper]. We also developed QXtract, an automatic query-based technique to make the extraction process over large text databases efficient by retrieving documents useful for the extraction of user-defined relations. This technique can be adapted to new domains, databases, or target relations with minimal human effort [ICDE 2003 paper; Snowball's and QXtract's web site at http://snowball.cs.columbia.edu; see also WebDB'03 paper].
Smart web search: We developed algorithms to classify web resources according to their "geographical scope," computed by analyzing the distribution of hyperlinks to the resources, as well as the resource contents [VLDB'00 paper; GeoSearch web site at http://geosearch.cs.columbia.edu].

Publications and Products

QProber: A System for Automatic Classification of Hidden-Web Databases, L. Gravano, P. Ipeirotis, and M. Sahami, in ACM Transactions on Information Systems, vol. 21, no. 1, Jan. 2003.
Optimizing Top-K Selection Queries over Multimedia Repositories, S. Chaudhuri, L. Gravano, and A. Marian, accepted for publication in IEEE Transactions on Knowledge and Data Engineering, 2003.
Learning to Find Answers to Questions on the Web, E. Agichtein, S. Lawrence, and L. Gravano, accepted for publication in ACM Transactions on Internet Technology, 2003.
Categorizing Web Queries According to Geographical Locality, L. Gravano, V. Hatzivassiloglou, and R. Lichtenstein, accepted for publication in Proc. of the 12th ACM International Conference on Information and Knowledge Management (CIKM 2003), 2003.
Modeling Query-Based Access to Text Databases, E. Agichtein, P. Ipeirotis, and L. Gravano, in Proc. of the ACM SIGMOD Workshop on the Web and Databases (WebDB 2003), 2003.
QXtract: A Building Block for Efficient Information Extraction from Text Databases (demonstration), E. Agichtein and L. Gravano, in Proc. of the 2003 ACM SIGMOD International Conference on Management of Data, 2003.
Querying Text Databases for Efficient Information Extraction, E. Agichtein and L. Gravano, in Proc. of the 19th IEEE International Conference on Data Engineering (ICDE 2003), 2003.
Distributed Search over the Hidden-Web: Hierarchical Database Sampling and Selection, P. Ipeirotis and L. Gravano, in Proc. of the 28th International Conference on Very Large Data Bases (VLDB 2002), 2002.
Top-K Selection Queries over Relational Databases: Mapping Strategies and Performance Evaluation, N. Bruno, S. Chaudhuri, and L. Gravano, in ACM Transactions on Database Systems, vol. 27, no.2, Jun. 2002.
Query- vs. Crawling-based Classification of Searchable Web Databases, L. Gravano, P. Ipeirotis, and M. Sahami, in IEEE Data Engineering Bulletin, vol. 25, no. 1, March 2002.
Evaluating Top-K Queries over Web-Accessible Databases, N. Bruno, L. Gravano, and A. Marian, in Proc. of the 18th IEEE International Conference on Data Engineering (ICDE 2002), 2002.
Extending SDARTS: Extracting Metadata from Web Databases and Interfacing with the Open Archives Initiative, P. Ipeirotis, T. Barry, and L. Gravano, in Proc. of the Second ACM+IEEE Joint Conference on Digital Libraries (JCDL 2002), 2002.
Probe, Count, and Classify: Categorizing Hidden Web Databases, P. Ipeirotis, L. Gravano, and M. Sahami, in Proc. of the 2001 ACM SIGMOD International Conference On Management of Data, 2001.
Snowball: A Prototype System for Extracting Relations from Large Text Collections (demonstration), E. Agichtein, L. Gravano, J. Pavel, V. Sokolova, and A. Voskoboynik, in Proc. of the 2001 ACM SIGMOD International Conference on Management of Data, 2001.
SDLIP + STARTS = SDARTS: A Protocol and Toolkit for Metasearching, N. Green, P. Ipeirotis, and L. Gravano, in Proc. of the First ACM+IEEE Joint Conference on Digital Libraries (JCDL 2001), 2001.
Learning Search Engine Specific Query Transformations for Question Answering, E. Agichtein, S. Lawrence, and L. Gravano, in Proc. of the 10th International World-Wide Web Conference (WWW10), 2001.
Computing Geographical Scopes of Web Resources, J. Ding, L. Gravano, and N. Shivakumar, in Proc. of the 26th International Conference on Very Large Data Bases (VLDB'00), 2000.
Combining Strategies for Extracting Relations from Text Collections, E. Agichtein, E. Eskin, and L. Gravano, in Proc. of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD 2000), 2000.
Snowball: Extracting Relations from Large Plain-Text Collections, E. Agichtein and L. Gravano, in Proc. of the 5th ACM International Conference on Digital Libraries (DL'00), 2000.
Exploiting Geographical Location Information of Web Pages, O. Buyukkokten, J. Cho, H. Garcia-Molina, L. Gravano, N. Shivakumar, in Proc. of the ACM SIGMOD Workshop on the Web and Databases (WebDB'99), 1999.
GlOSS: Text-Source Discovery over the Internet, L. Gravano, H. Garcia-Molina, A. Tomasic, in ACM Transactions on Database Systems, vol. 24, no. 2, Jun. 1999.

Project Impact

Amélie Marian, a third-year Ph.D. student at Columbia currently funded by this grant, has been working on the problem of processing "top-k" queries over autonomous, web-accessible databases. Amélie has co-authored a paper that got accepted for publication in the
IEEE Transactions on Knowledge and Data Engineering, as well as other papers currently in progress.
Eugene Agichtein, a fifth-year Ph.D. student at Columbia who used to be funded by this grant, has continued to work on the problem of extracting structured information from (large) text databases. (Structured information is better suited for querying, for example.) Our bootstrapping-based system, dubbed "Snowball," starts with a handful of examples of the kind of "tuples" that we want to extract. We analyze the context in which these example tuples occur in text documents, and generate high-precision patterns to find new tuples, as suggested by S. Brin in a paper in WebDB'98. The patterns in turn provide new high-precision tuples, and the process iterates until enough tuples are found. Another aspect of the information extraction problem that we have addressed is efficiency: Current information extraction techniques extract relations from a text database by examining every document in the database, or use filters to select promising documents for extraction. The exhaustive scanning approach is not practical or even feasible for large databases, and the current filtering techniques require human involvement to maintain and to adopt to new databases and domains. We have developed QXtract, an automatic query-based technique to retrieve documents useful for the extraction of user-defined relations from large text databases. Eugene won a "Best Student Paper" award for his ICDE'03 paper on the QXtract system, and has written numerous other papers this last year.
Panagiotis Ipeirotis, a fourth-year Ph.D. student at Columbia funded by a different grant, has been working on the problem of automatically classifying "hidden-web" databases into a Yahoo!-like hierarchy of topics automatically. Our work, in collaboration with Mehran Sahami, from Stanford and Google, uses machine learning tools to train a document classifier over the topic hierarchy of choice, and then turns the classifier model into queries to adaptively probe the text databases. Our technique manages to classify web databases accurately through a small number of query probes and without retrieving any documents, by just exploiting the number of matches for each query probe. We also developed an algorithm to derive "content summaries" from these databases, to be used by the database selection component of metasearchers. Our algorithm adaptively focuses on and extracts documents that are representative of the topic coverage of the databases. Our content summaries are the first to include absolute document frequency estimates for the database words. We also designed a novel database selection algorithm that exploits both the extracted content summaries and the hierarchical classification of the databases, automatically derived during probing, to compensate for potentially incomplete content summaries. Panagiotis has also been the leader of the SDARTS software-toolkit effort described above (source code is publicly available at the SDARTS web site), as well as co-authored numerous papers on these topics.
In addition, many undergraduate and MS students have contributed to different aspects of the activities above, and have done research-oriented projects for credit towards their degrees.
I have created a graduate-level course on advanced database systems that expanded the curriculum on databases in particular, and on information systems in general, at Columbia. The enrollments in this course have been substantial: 23 students in spring'98, 42 students in spring'99, 58 students in spring'00, 67 students in spring'02, and 46 students in spring'03.

Area References

Optimal Aggregation Algorithms for Middleware, R. Fagin, A. Lotem, and M. Naor, in Proceedings of the 20th Symposium on Principles of Database Systems (PODS 2001), May 2001.
Query-based Sampling of Text Databases, J. Callan and M. Connell, in ACM Transactions on Information Systems, vol. 19, no. 2, April 2001.
Extracting Patterns and Relations from the World-Wide Web, S. Brin, in Proceedings of the International Workshop on the Web and Database (WebDB'98), March 1998.
The Anatomy of a Large-scale Hypertextual Web Search Engine, S. Brin and L. Page, in Proceedings of the Seventh International World-Wide Web Conference (WWW-7), April 1998.
Searching Distributed Collections with Inference Networks, J. Callan, Z. Lu, and W. B. Croft, in Proceedings of the 18th ACM SIGIR conference (SIGIR'95), July 1995.
Authoritative Sources in a Hyperlinked Environment, J. Kleinberg, in Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, January 1998.

Project Websites

QProber, a system for automatically classifying and searching "hidden-web" text databases: http://qprober.cs.columbia.edu
RANK: Top-k query processing (e.g., over web-accessible databases): http://rank.cs.columbia.edu
Snowball and QXtract, building blocks for efficient information extraction: http://snowball.cs.columbia.edu
SDARTS, a protocol and toolkit for metasearching: http://sdarts.cs.columbia.edu
GeoSearch, a geographically-aware search engine: http://geosearch.cs.columbia.edu

Online Software

SDARTS, a protocol and toolkit for metasearching, at http://sdarts.cs.columbia.edu (source code and documentation of toolkit available at this web site).

Keywords web search, metasearching, hidden-web databases, information extraction, top-k query processing

Publications and Products

Project Impact

Area References

Project Websites

Online Software

Keywords
web search, metasearching, hidden-web databases, information extraction, top-k query processing