CAREER: Querying Information Sources across the Internet
Contact Information
WWW Pages for Project:
Project Award Information
-
Award Number: IIS-9733880
-
Duration: 9/1/1998 through 8/31/2002
-
Current year: 3rd year (no-cost extensions granted)
-
Title: CAREER: Querying Information Sources across
the Internet
Keywords: hidden-web databases,
web search, metasearching, top-k query processing,
information extraction
Project Summary
The goal of this research project is to help users find the
information that they need over the Internet. Unfortunately, Internet
information sources vary widely in the types of information and access
interfaces they provide. Furthermore, the number of available sources is
overwhelming to users. Therefore, exploiting this wealth of resources effectively
presents challenging problems, some of which we have been addressing in this
project:
- Exploiting contents of "hidden web" databases:
"Hidden web" databases
contain information that is not crawlable and hence is ignored by traditional search engines. We
developed an efficient algorithm for classifying such valuable databases through a
small number of query probes derived using machine learning techniques [SIGMOD'01 paper; QProber web site at
http://qprober.cs.columbia.edu].
- Metasearching over text databases: We have continued
to develop SDARTS, a toolkit to facilitate metasearching; SDARTS contains
generic, easily configurable wrappers for locally available plain-text and XML
document databases, and for remote web-accessible databases [JCDL'01
and
JCDL'02 papers; SDARTS web site at
http://sdarts.cs.columbia.edu].
- Top-k query processing: We developed
algorithms for processing "top-k" queries involving "attributes"
handled by autonomous web-accessible databases; our algorithms handle
databases supporting a variety of access interfaces, and attempt to minimize
access to remote databases [ICDE'02
paper; RANK web site at http://www.cs.columbia.edu/~nicolas/rank/].
- Information extraction from text documents: We
developed the Snowball system
for extracting structured information from text documents, starting with just a
handful of examples of the tuples to be extracted [ACM
DL'00 paper; Snowball web site at
http://snowball.cs.columbia.edu].
- Smart web search: We developed algorithms to classify
web resources
according to their "geographical scope," computed by analyzing the
distribution of hyperlinks to the resources, as well as the resource contents
[VLDB'00
paper; GeoSearch web site at
http://geosearch.cs.columbia.edu].
The educational objectives include the expansion of the
curriculum in databases and information systems at Columbia University. In
particular, a new graduate-level course covers the latest trends in database and
information systems research.
Publications and Products
-
Query- vs. Crawling-based Classification of Searchable Web
Databases, L. Gravano, P. Ipeirotis, and M. Sahami,
in IEEE Data Engineering Bulletin, vol. 25, no. 1, March 2002.
-
Evaluating Top-K Queries over Web-Accessible Databases,
N. Bruno, L. Gravano, and A. Marian, in Proc. of the 18th IEEE International
Conference on Data Engineering (ICDE 2002), 2002.
-
Extending SDARTS: Extracting Metadata from Web Databases
and Interfacing with the Open Archives Initiative, P.
Ipeirotis, T. Barry, and L. Gravano, to appear in Proc. of the Second ACM+IEEE
Joint Conference on Digital Libraries (JCDL 2002), 2002.
-
Probe,
Count, and Classify: Categorizing Hidden Web Databases, P.
Ipeirotis, L. Gravano, and M. Sahami, in Proc. of the 2001 ACM SIGMOD International Conference On Management of Data, 2001.
-
Snowball:
A Prototype System for Extracting Relations from Large Text Collections
(demonstration), E. Agichtein, L. Gravano, J. Pavel, V. Sokolova,
and A. Voskoboynik, in Proc. of the 2001 ACM SIGMOD International
Conference on Management of Data, 2001.
-
SDLIP + STARTS = SDARTS: A Protocol and Toolkit for
Metasearching, N. Green, P. Ipeirotis, and
L. Gravano, in Proc. of the First ACM+IEEE Joint Conference on Digital Libraries
(JCDL 2001), 2001.
-
Learning Search Engine Specific Query Transformations
for Question Answering,
E. Agichtein, S. Lawrence, and L. Gravano,
in Proc. of the 10th International World-Wide Web Conference (WWW10),
2001.
-
Computing
Geographical Scopes of Web Resources, J. Ding,
L. Gravano, and N. Shivakumar, in Proc. of the 26th International Conference
on Very Large Data Bases (VLDB'00), 2000.
-
Automatic
Classification of Text Databases through Query Probing, P.
Ipeirotis, L. Gravano, and M. Sahami, in Proc. of the ACM SIGMOD Workshop
on the Web and Databases (WebDB'00), 2000. Also in LNCS Series no. 1997,
Springer, 2001.
-
Combining
Strategies for Extracting Relations from Text Collections, E.
Agichtein, E. Eskin, and L. Gravano, in Proc. of the ACM SIGMOD Workshop
on Research Issues in Data Mining and Knowledge Discovery (DMKD 2000),
2000.
-
Snowball:
Extracting Relations from Large Plain-Text Collections, E.
Agichtein and L. Gravano, in Proc. of the 5th ACM International Conference
on Digital Libraries (DL'00), 2000.
-
Exploiting
Geographical Location Information of Web Pages, O.
Buyukkokten, J. Cho, H. Garcia-Molina, L. Gravano, N. Shivakumar, in Proc.
of the ACM SIGMOD Workshop on the Web and Databases (WebDB'99), 1999.
-
GlOSS:
Text-Source Discovery over the Internet, L. Gravano, H. Garcia-Molina,
A. Tomasic, in ACM Transactions on Database Systems, vol. 24, no. 2, Jun.
1999.
-
Software:
SDARTS, a protocol and toolkit
for metasearching, at
http://sdarts.cs.columbia.edu
(source code and documentation of toolkit available at this web site).
Indication of Success and Project Impact
-
Amélie Marian,
a second-year Ph.D. student at Columbia currently funded by this grant, has been
working on the problem of processing "top-k" queries over autonomous,
web-accessible databases.
Amélie presented a paper with her results at
the IEEE ICDE 2002 conference in March.
-
Eugene Agichtein,
a fourth-year Ph.D. student at Columbia who used to be funded by this grant, has
continued to work on the problem of extracting structured information from
WWW pages. (Structured information is better suited for querying, for example.)
Eugene presented a paper describing his system, Snowball, at the ACM Digital
Libraries 2000 conference. Eugene presented another Snowball paper, which
describes an alternative way to produce extraction patterns through Sparse
Markov Transducers, at the ACM SIGMOD Workshop on Research Issues in Data
Mining and Knowledge Discovery. Finally, a demo of the Snowball system
was presented at the ACM SIGMOD 2001 conference. Eugene has also worked on the
problem of question-answering over the web. We published a paper in collaboration with Steve Lawrence, from NEC
Research, at the 10th International World-Wide Web Conference
(WWW10) in 2001.
-
Panagiotis Ipeirotis,
a third-year Ph.D. student at Columbia funded by a different grant, has
been working on the problem of automatically classifying "hidden-web" databases into a Yahoo!-like hierarchy of topics automatically. Our
work, in collaboration with Mehran Sahami from E.piphany Inc., uses machine
learning tools to train a rule-based document classifier over the topic
hierarchy of choice, and then turns the classifier rules into queries to
adaptively probe the text databases. Our technique manages to classify
web databases accurately through a small number of query probes and without
retrieving any documents, by just exploiting the number of matches
for each query probe. Preliminary work on this problem was presented at
the ACM SIGMOD Workshop on the Web and Databases (WebDB 2000), and a full
paper describing our technique and its evaluation using over 100 real web-accessible
databases appeared in the ACM SIGMOD 2001 conference. Panagiotis has also been
the leader of the SDARTS effort described above, which resulted in two papers
(at the ACM+IEEE Joint Conference in Digital Libraries (JCDL) 2001 and 2002);
the source code for the SDARTS toolkit is publicly available at the SDARTS web
site.
-
Junyan Ding,
a fourth-year Ph.D. student at Columbia funded by a different grant, worked
until 12/2000 on the problem of indexing and extracting non-traditional
properties of WWW resources. Junyan presented a paper that we wrote in
collaboration with Narayanan
Shivakumar, from Gigabeat Inc., in the VLDB 2000 conference.
-
Jon Oringer, a then MS student at Columbia, worked over summer
1999 for this project, and created a geographically-aware search engine
that implements the ideas that we have developed with Junyan Ding and Narayanan
Shivakumar. This search engine is available at http://geosearch.cs.columbia.edu.
-
In addition, several undergraduate and MS students have contributed
to different aspects of the activities above, and have done research-oriented
projects for credit towards their degrees.
-
I have created a graduate-level course on advanced database systems that
expanded the curriculum on databases in particular, and on information
systems in general, at Columbia. The enrollments in this course have been
soaring: 23 students in spring'98, 42 students in spring'99, 58 students
in spring'00, and 67 students in spring'02.
Area References
-
Optimal Aggregation Algorithms for Middleware, R. Fagin,
A. Lotem, and M. Naor, in Proceedings of the 20th Symposium on Principles of
Database Systems (PODS 2001), May 2001.
-
Query-based Sampling of Text Databases, J. Callan and M.
Connell, in ACM Transactions on Information Systems, vol. 19, no. 2, April 2001.
-
Extracting Patterns and Relations from the World-Wide
Web, S. Brin, in Proceedings of the International Workshop on the Web
and Database (WebDB'98), March 1998.
-
The Anatomy of a Large-scale Hypertextual Web Search Engine,
S. Brin and L. Page, in Proceedings of the Seventh International World-Wide
Web Conference (WWW-7), April 1998.
-
Searching Distributed Collections with Inference Networks,
J. Callan, Z. Lu, and W. B. Croft, in Proceedings of the 18th ACM SIGIR
conference (SIGIR'95), July 1995.
-
Authoritative Sources in a Hyperlinked Environment,
J. Kleinberg, in Proceedings of the Ninth Annual ACM-SIAM Symposium
on Discrete Algorithms, January 1998.