CAREER: Querying Information Sources across the Internet
NSF Award IIS-9733880
Principal Investigator
Luis Gravano
Computer Science Department
Columbia University
1214 Amsterdam Avenue
New York, NY 10027
Phone: 1-212-939-7064
Fax: 1-212-666-0140
Email:
gravano@cs.columbia.edu
Homepage:
http://www.cs.columbia.edu/~gravano/
Keywords
web search, metasearching,
hidden-web databases,
information extraction, top-k query processing
Project Summary
The goal of this research project is to help users
find the information that they need over the Internet. Unfortunately,
Internet information sources vary widely in the types of information and
access interfaces they provide. Furthermore, the number of available
sources is overwhelming to users. Therefore, exploiting this wealth of
resources effectively presents challenging problems, some of which we
have been addressing in this project:
- Exploiting contents of "hidden web" databases:
"Hidden web" databases contain information that is not crawlable and hence is
ignored by traditional search engines. We developed an efficient algorithm for
classifying such valuable databases through a small number of query probes
derived using machine learning techniques [TOIS'03
paper;
SIGMOD'01 paper].
We also extract fine-grain, accurate content summaries of
the database contents via hierarchical query probes, and use these content
summaries to integrate database classification with distributed searching
(i.e., metasearching) via a hierarchical database selection algorithm [VLDB'02
paper;
QProber web site at
http://qprober.cs.columbia.edu].
- Metasearching over text databases: We have continued
to develop SDARTS, a toolkit to facilitate metasearching; SDARTS contains
generic, easily configurable wrappers for locally available plain-text and XML
document databases, and for remote web-accessible databases [JCDL'01
and
JCDL'02 papers; SDARTS web site at
http://sdarts.cs.columbia.edu].
- Top-k query processing: We developed
algorithms for processing "top-k" queries involving "attributes"
handled by autonomous web-accessible databases; our algorithms handle
databases supporting a variety of access interfaces, and attempt to minimize
access to remote databases [ICDE'02
paper; RANK web site at
http://rank.cs.columbia.edu]. We also handle relational [
- Efficient information extraction from text documents:
We developed the Snowball system for extracting
structured information from text documents, starting with just a handful of
examples of the tuples to be extracted [ACM
DL'00 paper]. We also developed QXtract, an automatic query-based
technique to make the extraction process over large text databases efficient
by retrieving documents useful for the extraction of user-defined
relations. This technique can be adapted to new
domains, databases, or target relations with minimal human effort [ICDE
2003 paper; Snowball's and QXtract's web site at
http://snowball.cs.columbia.edu;
see also
WebDB'03 paper].
- Smart web search: We developed algorithms to
classify web resources according to their "geographical scope," computed by
analyzing the distribution of hyperlinks to the resources, as well as the
resource contents [VLDB'00
paper; GeoSearch web site at
http://geosearch.cs.columbia.edu].
Publications and Products
-
QProber: A System for Automatic Classification of
Hidden-Web Databases, L. Gravano, P.
Ipeirotis, and M. Sahami, in ACM Transactions on Information Systems,
vol. 21, no. 1, Jan. 2003.
-
Optimizing Top-K Selection Queries over Multimedia
Repositories, S.
Chaudhuri, L. Gravano, and A. Marian, accepted for publication in IEEE
Transactions on Knowledge and Data Engineering, 2003.
-
Learning to Find Answers to Questions on the Web,
E. Agichtein, S. Lawrence, and L. Gravano, accepted for publication in
ACM Transactions on Internet Technology, 2003.
- Categorizing Web Queries According to
Geographical Locality, L. Gravano, V.
Hatzivassiloglou, and R. Lichtenstein, accepted for publication in
Proc. of the 12th ACM International Conference on Information and
Knowledge Management (CIKM 2003), 2003.
-
Modeling Query-Based Access to Text Databases,
E. Agichtein, P. Ipeirotis, and L. Gravano, in
Proc. of the ACM SIGMOD Workshop on the
Web and Databases (WebDB 2003), 2003.
- QXtract: A Building Block for Efficient
Information Extraction from Text Databases (demonstration), E.
Agichtein and L. Gravano, in Proc. of the 2003 ACM SIGMOD
International Conference on Management of Data, 2003.
-
Querying Text Databases for Efficient Information
Extraction, E. Agichtein and L. Gravano,
in Proc. of the 19th IEEE International Conference on Data Engineering
(ICDE 2003), 2003.
-
Distributed Search over the Hidden-Web: Hierarchical
Database Sampling and Selection, P.
Ipeirotis and L. Gravano, in
Proc. of the 28th International Conference on Very Large Data Bases (VLDB
2002), 2002.
-
Top-K Selection Queries over Relational Databases:
Mapping Strategies and Performance Evaluation,
N. Bruno, S. Chaudhuri, and L. Gravano, in ACM Transactions on
Database Systems, vol. 27, no.2, Jun. 2002.
-
Query- vs. Crawling-based Classification of Searchable
Web Databases, L. Gravano, P. Ipeirotis,
and M. Sahami, in IEEE Data Engineering Bulletin, vol. 25, no. 1,
March 2002.
-
Evaluating Top-K Queries over Web-Accessible
Databases, N. Bruno, L. Gravano,
and A. Marian, in Proc. of the 18th IEEE International Conference on
Data Engineering (ICDE 2002), 2002.
-
Extending SDARTS: Extracting Metadata from Web Databases and
Interfacing with the Open Archives Initiative,
P. Ipeirotis, T. Barry, and L. Gravano, in
Proc. of the Second ACM+IEEE Joint Conference on Digital Libraries (JCDL
2002), 2002.
-
Probe, Count, and Classify: Categorizing Hidden Web Databases,
P. Ipeirotis, L. Gravano, and M. Sahami, in
Proc. of the 2001 ACM SIGMOD International Conference On Management of
Data, 2001.
-
Snowball: A Prototype System for Extracting Relations from Large Text
Collections (demonstration), E. Agichtein, L. Gravano, J.
Pavel, V. Sokolova, and A. Voskoboynik, in Proc. of the 2001 ACM
SIGMOD International Conference on Management of Data, 2001.
-
SDLIP + STARTS = SDARTS: A Protocol and Toolkit for
Metasearching, N. Green,
P. Ipeirotis, and L. Gravano, in Proc. of the First ACM+IEEE Joint
Conference on Digital Libraries (JCDL 2001), 2001.
-
Learning Search Engine Specific Query Transformations for Question
Answering, E. Agichtein, S. Lawrence, and L. Gravano, in Proc.
of the 10th International World-Wide Web Conference (WWW10), 2001.
-
Computing Geographical Scopes of Web Resources,
J. Ding, L. Gravano, and N. Shivakumar, in Proc. of
the 26th International Conference on Very Large Data Bases (VLDB'00),
2000.
-
Combining Strategies for Extracting Relations from Text Collections,
E. Agichtein, E. Eskin, and L. Gravano, in
Proc. of the ACM SIGMOD Workshop on Research Issues in Data Mining and
Knowledge Discovery (DMKD 2000), 2000.
-
Snowball: Extracting Relations from Large Plain-Text Collections,
E. Agichtein and L. Gravano, in Proc. of
the 5th ACM International Conference on Digital Libraries (DL'00),
2000.
-
Exploiting Geographical Location Information of Web Pages,
O. Buyukkokten, J. Cho, H. Garcia-Molina, L.
Gravano, N. Shivakumar, in Proc. of the ACM SIGMOD Workshop on the
Web and Databases (WebDB'99), 1999.
-
GlOSS: Text-Source Discovery over the Internet, L. Gravano, H.
Garcia-Molina, A. Tomasic, in ACM Transactions on Database Systems,
vol. 24, no. 2, Jun. 1999.
Project Impact
- Amélie
Marian, a third-year Ph.D. student at
Columbia currently funded by this grant, has been working on the
problem of processing "top-k" queries over autonomous,
web-accessible databases. Amélie has co-authored a paper that got
accepted for publication in the
IEEE Transactions on Knowledge and Data Engineering, as
well as other papers currently in progress.
-
Eugene Agichtein, a fifth-year Ph.D. student at Columbia who used
to be funded by this grant, has continued to work on the problem of
extracting structured information from (large) text databases. (Structured
information is better suited for querying, for example.) Our
bootstrapping-based system, dubbed "Snowball," starts with a handful
of examples of the kind of "tuples" that we want to extract. We
analyze the context in which these example tuples occur in text
documents, and generate high-precision patterns to find new tuples, as
suggested by S. Brin in a paper in WebDB'98. The patterns in turn
provide new high-precision tuples, and the process iterates until
enough tuples are found. Another aspect of the information extraction
problem that we have addressed is efficiency: Current information
extraction techniques extract relations from a text database by
examining every document in the database, or use filters to select
promising documents for extraction. The exhaustive scanning approach
is not practical or even feasible for large databases, and the current
filtering techniques require human involvement to maintain and to
adopt to new databases and domains. We have developed QXtract, an
automatic query-based technique to retrieve documents useful for the
extraction of user-defined relations from large text databases. Eugene won a
"Best Student Paper" award for his ICDE'03 paper on the QXtract
system, and has written numerous other papers this last
year.
-
Panagiotis Ipeirotis, a fourth-year Ph.D. student at Columbia
funded by a different grant, has been working on the problem of
automatically classifying "hidden-web" databases into a Yahoo!-like
hierarchy of topics automatically. Our work, in collaboration with Mehran Sahami, from
Stanford and Google, uses machine learning tools to
train a document classifier over the topic hierarchy of
choice, and then turns the classifier model into queries to adaptively
probe the text databases. Our technique manages to classify web
databases accurately through a small number of query probes and
without retrieving any documents, by just exploiting the number of
matches for each query probe. We also developed an algorithm to
derive "content summaries" from these databases, to be used by the
database selection component of metasearchers. Our algorithm
adaptively focuses on and extracts
documents that are representative of the topic coverage of the
databases. Our content summaries are the first to include absolute
document frequency estimates for the database words. We also designed
a novel database selection algorithm that exploits both the extracted
content summaries and the hierarchical classification of the databases,
automatically derived during probing, to compensate for potentially
incomplete content summaries. Panagiotis has also been the leader of the
SDARTS software-toolkit effort described above (source code is publicly available at the SDARTS web site),
as well as co-authored numerous papers on these topics.
- In addition, many undergraduate and MS students
have contributed to different aspects of the activities above, and
have done research-oriented projects for credit towards their degrees.
- I have created a graduate-level course on advanced
database systems that expanded the curriculum on databases in
particular, and on information systems in general, at Columbia. The
enrollments in this course have been substantial: 23 students in
spring'98, 42 students in spring'99, 58 students in spring'00, 67
students in spring'02, and 46 students in spring'03.
Area References
- Optimal Aggregation Algorithms for Middleware,
R. Fagin, A. Lotem, and M. Naor, in Proceedings of the 20th Symposium
on Principles of Database Systems (PODS 2001), May 2001.
- Query-based Sampling of Text Databases, J.
Callan and M. Connell, in ACM Transactions on Information Systems,
vol. 19, no. 2, April 2001.
- Extracting Patterns and Relations from the
World-Wide Web, S. Brin, in Proceedings of the International
Workshop on the Web and Database (WebDB'98), March 1998.
- The Anatomy of a Large-scale Hypertextual Web
Search Engine, S. Brin and L. Page, in Proceedings of the Seventh
International World-Wide Web Conference (WWW-7), April 1998.
- Searching Distributed Collections with
Inference Networks, J. Callan, Z. Lu, and W. B. Croft, in
Proceedings of the 18th ACM SIGIR conference (SIGIR'95), July 1995.
- Authoritative Sources in a Hyperlinked
Environment, J. Kleinberg, in Proceedings of the Ninth Annual
ACM-SIAM Symposium on Discrete Algorithms, January 1998.
Project Websites
Online Software