Carl's Research Page!


From September of 1997 through February 2003, I was a member of Columbia's Natural Language Processing Group. My research has primarily focused on the automatic categorizing of images using associated text. You can now check out my Ph.D. thesis, "Robust Statistical Techniques for the Categorization of Images Using Associated Text" (or click here for two-sided version). You can also check out the PowerPoint presentation that I presented at my thesis defense. In order to have the means of coducting this research, I have created Columbia University's Text Categorization and Image corpus, and I hope that this corpus becomes an important resource in the field.

Part of my research has involved the design and implementation of two novel text categorization systems. The first involves using a technique known as density estimation to improve Rocchio-based text categorization, while the second involves the use of bins to empirically estimate term weights for groups of words that share similar features. Check out the page devoted to my research on bins.

My systems have been trained to categorize entire news documents into the categories Struggle, Politics, Disaster, Crime, or Other; the images embedded in the Disaster documents as Workers Responding, Affected People, Wreckage, or Other; the images embedded in the Politics documents as Meeting, Announcement, Politician Photographed, Civilians, Military, and Other; and the images in general as either Indoor or Outdoor. One of my systems is also used for Newsblaster, a system that crawls the web in search of news articles which then get clustered, categorized, and summarized! Newsblaster is automatically updated every day, and will soon be incremented throughout the day. The categories involved with Newsblaster are U.S. News, World News, Finance, Entertainment, Science and Technology, and Sports.

I have seen some especially interesting results involving the cateogries for the Disaster images. For these categories, it turns out that NLP techniques are necessary for optimal results! Check out our paper from EMNLP 2002 which discusses these results, or the PowerPoint presentation that I presented at the conference. (More up-to-date results discussing this research appears in Chapter 6 of my thesis; see link at top of page.)

Originally, I payed special attention to the Indoor and Outdoor categories, and I have a special page with links to documents and other information pertaining to this early work. Check out the page devoted to this work.

You can also obtain some of my publications on-line.


Click here to view my student page!

Click here to view my personal home page!