Programs

IGERT DISTINGUISHED SPEAKER SERIES: Past speakers

Fall 2012

09/18/12
"Solving Problems and Challenges in Telling Good Stories from Data" by Mark Hansen (Columbia Journalism School)

Abstract:
Professor Hansen will talk about telling complex, open-ended stories from equally complex, open-ended data. Over the years, he has had the pleasure of helping out on a number of large-scale art installations – you can think of them as a kind of data visualization – which draw on big data sets or, more precisely, streams. Prof. Hansen looks forward to spend the brown bag session on solving problems and challenges in telling good stories from data.

10/16/12
"Temporal Knowledge Base Population" by Heng Ji (CUNY - Queens)

Abstract:
Temporal Information Extraction is of significant interest for a variety of natural language processing (NLP) applications. Some early work focused on extracting temporal relations from individual documents in isolation. Most of these methods has been developed around the TempEval task using TimeBank (Pustejovsky et al., 2003). In practice, however, we may need to gather information about an entity that is scattered among the documents of a large collection. This requires the ability to identify the relevant documents and to integrate facts - possibly redundant, possibly complementary, possibly in conflict - expressed in these documents. Furthermore, we may want to use the extracted information to augment an existing data base. Recent research on temporal Knolwedge Base Population advances traditional temporal Information Extraction from a single-document to a cross-document paradigm, so that much richer information can be discovered from large-scale corpora using cross-document aggregation. This new setting presents many new challenges pertaining to both annotation acquisition and system design. The goal of this talk is to provide an overview of these challenges, linguistic theories behind various temporal information representations, general algorithmic framework/tools, state-of-the-art algorithmic approaches to specific problems, and analysis of evaluation results. We will focus on concrete temporal classification and structured prediction solutions that deal with the lack of data (distant supervision, instance re-labeling). Our toolkit is made freely available to the research communities.

11/20/12
"It's Time for Events: Event Information in Social Media" by Mor Naaman (Rutgers Univ.)

Abstract:
Increasingly, the bulk of information from world events is being contributed by individuals through social media channels: on photo and video-sharing sites (e.g., Flickr, YouTube), as well as on social networking sites (e.g., Facebook, Twitter). These events range from major global events such as the Syrian uprising or the earthquake in Haiti, to local events and emergencies like a plane landing on the Hudson river, to more mundane events such as a Presidential speech or a music festival. I will described research on the identification of content related to events (currently fragmented across social media sites), and presentation and visualization techniques for event content that can otherwise be overwhelming in volume. With this work, we enable multiple stakeholders like journalists, first responders, researchers, policy makers, and the public to see and understand what happened in world events, using social media.

11/27/12
"Digital Humanities at Columbia" by Dennis Y. Tenen (Columbia University, English & Comparative Literature)

Abstract:
Dennis Tenen works in the field of computational culture studies both as in the critical study of computational culture and in the sense of applying computational approaches to the study of culture. His talk will introduce the audience to the emerging discipline of digital humanities, discuss its problems and premises, and finally outline several local projects aimed at transforming institutions that shape our teaching and research.

Spring 2013

02/08/13
"Watching the Watchers: How the Wall Street Journal uses Big Data to Investigate Digital Privacy" by Julia Angwin (Technology Journalist at Wall Street Journal) and Jeremy Singer-Vine (reporter at Wall Street Journal)

Abstract:
Data is political. That is why the Wall Street Journal's data investigations team likes to collect its own data, rather than rely on data collected by others. Journal investigative reporters Julia Angwin and Jeremy Singer-Vine talk about what it takes to turn computing power into compelling narrative stories for a broad audience.

02/22/13
"Heterogeneity Meets Rarity: Mining Multi-Faceted Diamond" by Jingrui He (Stevens Institute of Technology)

Abstract:
Many real-world problems exhibit both heterogeneity and rarity. Take insider threat detection from various social contexts as an example. While the target malicious insiders may only be a very small portion of the entire population (i.e., rarity), each person can be characterized by rich features, such as social friendship, emails, instant messages, etc (i.e., feature heterogeneity). Moreover, different types of insiders, though correlated, may exhibit different statistical characteristics (i.e., task heterogeneity). For such problems, how can we quickly identify an example from a new rare category? How can we leverage both feature heterogeneity and task heterogeneity to maximally boost the learning performance? In this talk, I will present our recent work on addressing these two challenges. For the challenge of rarity, I will introduce rare category analysis, e.g., how to detect the rare examples with the help of a labeling oracle. For the challenge of heterogeneity, I will present a graph-based approach taking into consideration both feature heterogeneity and task heterogeneity. I will also talk about how these techniques can be used in applications such as insider threat detection.

03/01/13
"Language as influence(d): Power and memorability" by Lillian Lee (Cornell Univ.)

Abstract:
What effect does language have on people, and what effect do people have on language? You might say in response, "Who are you to discuss these problems?" and you would be right to do so; these are Major Questions that science has been tackling for many years. But as a field, I think natural language processing and computational linguistics have much to contribute to the conversation, and I hope to encourage the community to further address these issues. To this end, I'll describe two efforts I've been involved in. The first project provides evidence that in group discussions, power differentials between participants are subtly revealed by how much one individual immediately echoes the linguistic style of the person they are responding to. We consider multiple types of power: status differences (which are relatively static), and dependence (a more "situational" relationship). Using a precise probabilistic formulation of the notion of linguistic coordination, we study how conversational behavior can reveal power relationships in two very different settings: discussions among Wikipedians and arguments before the U.S. Supreme Court. Our second project is motivated by the question of what information achieves widespread public awareness. We consider whether, and how, the way in which the information is phrased --- the choice of words and sentence structure --- can affect this process. We introduce an experimental paradigm that seeks to separate contextual from language effects, using movie quotes as our test case. We find that there are significant differences between memorable and non-memorable quotes in several key dimensions, even after controlling for situational and contextual factors. One example is lexical distinctiveness: in aggregate, memorable quotes use less common word choices (as measured by statistical language models), but at the same time are built upon a scaffolding of common syntactic patterns. Joint work with Justin Cheng, Cristian Danescu-Niculescu-Mizil, Jon Kleinberg, and Bo Pang.

03/29/13
"The Promise of Crowdsourcing for Natural Language Processing and Other Data Sciences" by Chris Callison-Burch (Johns Hopkins Univ.)

Abstract:
Crowdsourcing is a new tool for data scientists that allows us to collect data and annotations on a large scale and at low cost. This offers new possibilities for research in economics, linguistics and other social sciences, as well as for computer vision, natural language processing (NLP) and other machine learning applications. Prof. Callison-Burch will discuss how he uses crowdsourcing to create speech and language data for NLP. He will detail a number of his own recent experiments using Amazon Mechanical Turk for NLP, including:

Building quality control models to achieve professional translation quality from non-professional translators
Taking a census of the language skills of 4000 Turkers from more than 100 countries
Collecting sufficient volumes of data to train statistical translation models that beat the state of the art translation systems.

He will also present some of his preliminary studies into collecting political science data. He will discuss the general challenges of crowdsourcing, including quality control, conveying complex tasks to lay users, professional v. non-professional annotation, and the advantages, including scalability and access to a worldwide workforce with diverse language skills. Based on his own experience, he attempts to give general guidance about when crowdsourcing works and when it doesn't, and how to customize your annotation schemes to be more appropriate to the Mechanical Turk crowdsourcing platform.

04/12/13
“Text Normalization” by Richard Sproat (Google)

Abstract:
The education page for the Columbia IGERT program lists this speaker series as taking place "every other Friday, 3:00-4:30pm beginning February 8th, 2013". Any competent speaker of English reading this phrase aloud would read "3:00-4:30pm" as something like "three to four thirty pee em", "8th" as the ordinal number "eighth", and "2013" as "twenty thirteen". Yet none of the written expressions are found in any dictionary; speakers must translate these "non-standard words" into ordinary words as part of their process of converting between text and speech.

Text normalization is the problem of building computational algorithms that mimic this process -- for example as part of a text-to-speech synthesizer. Since there are many classes of "non-standard words" -- besides dates and times, there are currency amounts, measures, abbreviations, etc. -- building a wide-coverage text normalization system for a language is labor intensive. For some languages, such as Russian with its complex inflectional morphology, the process can be especially difficult. In this talk he will outline the problem, and give specific examples of why it is hard, and why large parts of text normalization systems are still constructed by hand, rather than trained using machine-learning algorithms. Nevertheless, there are some areas of the problem where machine-learning has made inroads, and he will discuss some of his own research along those lines.

04/19/13
"Three Problems in Social Computing" by Sep Kamvar (MIT)

Abstract:
Social computing has come to refer to three things: the development of algorithms that use personal and social features, the mining of social data, and the building of systems that interact with networks of people. These three branches tend to have quite different technical challenges. I'll discuss one problem from each area to highlight both the differences and the overlap between the three branches, and I'll discuss the challenges that this new field poses to research in numerical linear algebra, machine learning, human-computer interaction, and programming languages.

04/22/13
"Opportunities from Social Media Data for Public Health" by Mark Dredze (Johns Hopkins University)

Abstract:
Twitter and other social media sites contain a wealth of information about populations and has been used to track sentiment towards products, measure political attitudes, and study social linguistics. In this talk, we investigate the potential for Twitter and social media to impact public health research. Broadly, we explore a range of applications for which social media may hold relevant data, including disease surveillance, public safety, and drug usage patterns. To uncover these trends, we develop new statistical models that can reveal trends and patterns of interest to public health from vast quantities of data. Our results suggest that social media has broad applicability for public health research.

Spring 2014

1/31/14
Reverse engineering the neural mechanisms involved in speech processing by Nima Mesgarani (Assistant Professor of Electrical Engineering at Columbia)

Abstract:
The brain empowers humans and other animals with remarkable abilities to navigate their acoustic environment in highly degraded conditions. This seemingly trivial task for humans have proven extremely difficult to model and implement in machines. One crucial limiting factor has been the need for a deep interaction between two very different disciplines, that of neuroscience and engineering. In this talk, I will present results of an interdisciplinary research effort to address the following fundamental questions: 1) How does the brain represent and process speech in normal and noisy conditions? 2) How could we model and implement these processes algorithmically? 3) Could we build an interface to directly read speech signals from the brain? I will present results of recent experiments where electrodes were surgically implanted in the auditory cortex of epilepsy patients, revealing unprecedented view of the neural activity in human brain. These findings have inspired novel speech processing algorithms that have been used by DARPA and other agencies. This integrated research approach can lead to a better scientific understanding of the brain, and a new generation of Brain-Machine interfaces that eventually can allow for communication by people who have lost their ability to speak. These results have appeared in journals such as Science and Nature.

2/14/14
Modeling Social Data by Jake Hofman (Researcher at Microsoft Research in New York City)

Abstract:
In this talk I will provide an overview of several recent projects in modeling social data, ranging from Web search and browsing activity to social media content. First, I will discuss the utility of search activity for predicting collective behavior, specifically future sales of music, movies, and video games. Next, I will present work that pairs browsing histories for a large, representative panel of individuals with user-level demographic data to study variation in Web activity among different demographic groups. I will conclude with an empirical study of information diffusion that investigates the structure of billions of diffusion events on Twitter.

2/21/14
Deep Learning and the Representation of Natural Data by Yann LeCun (Director of AI Research at Facebook)

Abstract:
The combined emergence of very large datasets, powerful parallel computers, and new machine learning methods, has enabled the deployment of highly-acurate computer perception systems, and is opening the door to a wide deployment of AI systems. A key component in systems that can understand natural data is a module that turns the raw data into an suitable internal representation. But designing and building such a module, often called a feature extractor, requires a considerable amount of engineering efforts and domain expertise.

The main objective of 'Deep Learning' is to come up with learning methods that can automatically produce good representations of data from labeled or unlabeled samples. Deep learning allows us to construct systems that are trained end to end, from raw inputs to ultimate output. Instead of having a separate feature extractor and perdictor, deep architectures have multiple stages in which the data is represented hierarchically: features in successive stages are increasingly global, abstract, and invariant to irrelevant transformations of the input.

The convolutional network model (ConvNet) is a particular type of deep architecture that is somewhat inspired by biology, and consist of multiple stages of filter banks, interspersed with non-linear operations, and spatial pooling. ConvNets, have become the record holder for a wide variety of benchmarks and competition, including object detection, localization, and recognition in image, semantic image segmentation and labeling (2D and 3D), acoustic modeling for speech recognition, drug design, handwriting recognition, biological image segmentation, etc.

The most recent speech recognition and image understanding systems deployed by Facebook, Google, IBM, Microsoft, Baidu, NEC and others use deep learning, and many use convolutional networks. Such systems use very large and very deep ConvNets with billions of connections, trained using backpropagation with stochastic gradient, with heavy regularization. But many new applications require the use of unsupervised feature learning methods. A number of methods based on sparse auto-encoder will be presented.

Several applications will be shown through videos and live demos, including a category-level object recognition system that can be trained on the fly, a system that can label every pixel in an image with the category of the object it belongs to (scene parsing), a pedestrian detector, and object localization and detection systems that rank first on the ImageNet Large Scale Visual Recognition Challenge data. Specialized hardware architecture that run these systems in real time will also be described.

3/28/14
When Enough is Enough: Location Tracking, Mosaic Theory and Machine Learning by Steve Bellovin (Professor of Computer Science at Columbia), Sebastian Zimmeck (lawyer and Ph.D. candidate in Computer Science at Columbia) & Tony Jebara (Associate Professor of Computer Science at Columbia)

Abstract:
Should police be required to get a warrant before tracking a person’s location? It’s an open legal question. On the one hand, movement in public is, of course, public; on the other hand, there is little doubt that the totality of a person’s movements can reveal intimate details of someone’s life. Scholars who prefer the latter viewpoint (which is known as the “mosaic theory”) say that prolonged tracking is a search within the meaning of the Fourth Amendment; accordingly, a warrant should be required. However, opponents have raised a number of objections, including the difficulty of drawing the line between permissible short-term surveillance and impermissible warrantless long-term monitoring. Indeed, Supreme Court Justice Scalia has observed that “it remains unexplained why a 4-week investigation is ‘surely’ too long . . . . What of a 2-day monitoring . . . .?” In our study, we use computer science to answer this legal question. In particular, we show how machine learning can be combined with k-anonymity and other privacy metrics and how the pairing can be applied to the Fourth Amendment. Based on our concept and the results of human mobility studies, we believe that on average more than a week of GPS location tracking without a warrant violates the Fourth Amendment.

Spring 2015

1/30/15
The Making and Knowing Project: Historians in the Laboratory by Pamela Smith (Columbia University, History)

Abstract: Pre-modern craft knowledge is one of the foundations of modern science, but we have very little insight into the largely oral culture of the craft workshop. The Making and Knowing Project brings together scholars in the humanities, natural sciences, and digital studies to reconstruct the technical procedures contained in a rare sixteenth-century written document from a craft workshop, in which an anonymous French-speaking practitioner took the unusual step of setting down on paper his techniques for a number of processes that we would now classify as part of the fine arts, of craft, and of technology. One aim of The Making and Knowing project is to produce an open-access digital edition and English translation of this intriguing text. The digital edition is only one dimension of the project, however, for the process by which this critical edition will be produced is as important as its product. Research for the edition forms an experiment in both pedagogy and humanistic research. It will involve Columbia students working alongside academic and museum-based historians of art and of science in collaboration with experienced makers and digital scholars to reconstruct the technical recipes contained in the manuscript. Their findings will be used to understand and annotate the digital edition, and their experiences will foster the sharing of expertise across disciplines as well as the engagement on the part of students with the material culture of the past.

2/6/15
Computational Healthcare - Healthcare in the Era of Big Data by Shahram Ebadollahi (IBM Research)

Abstract: The healthcare industry is at an interesting juncture. On one hand, data and knowledge are being generated and are becoming increasingly accessible at very large volumes. On the other hand, there has been great advances in what is referred to as big data technologies. The confluences of these trends has the potential of enabling dramatical advances in the broad area of healthcare across payers, providers and pharmaceutical companies leading to better health and well-being of the people. In this talk, I will take the audience through some of the novel technologies and methodologies and will provide examples where use of such advanced technologies is already making a difference across the different constituents of the healthcare eco-system.

2/27/15
From Data to Discovery: Data-Driven Approach to Facilitating Chronic Disease Self-Management by Lena Mamykina (Columbia University, Biomedical Informatics)

Abstract: With the growing prevalence of chronic diseases, more individuals need to proactively engage in self-management of their health. For many chronic conditions such as asthma, hypertension, and diabetes, self-monitoring has long been an integral and critical component of self-management. Novel technologies provide an unprecedented opportunity to capture and monitor data related to health and wellness. Yet, despite the general enthusiasm for data-enabled discovery in healthcare in general, and in health self-management in particular, there remains considerable skepticism regarding ability of individuals and their providers to make sense of the data collected through self-monitoring, and translate these data into improvements in self-management behaviors. Recent research showed that emerging wearable self-monitoring technologies are falling short of inspiring long-term adoption and are often abandoned after only 6 months of use. In this talk I will discuss results of several studies that suggest potential reasons for the abandonment of self-monitoring technologies, and outline directions for future research in data-driven technologies for facilitating health self-management.

3/6/15
Hate Speech Detection by Joel Tetrault, TempEval and "real world" date/time challenges by Amanda Stent, Insights from Big Data: Interaction, Design, and Innovation by Alex Jaimes (Yahoo Lab)

Abstract: Hate speech can be defined as any abusive language directed towards specific minority groups with the intention to demean. While several countries actually protect this type of language under the right to free speech, many internet providers prohibit the use of the language on their properties under their terms of service. The reason for this is that such language makes internet forums and comment sections unwelcoming and thus stunts discussion. In this talk, we describe preliminary work into detecting hate speech and malicious language on the internet. Specifically, we discuss issues with defining hate speech and its impact on annotation and evaluation, and then describe a statistical classifier for evaluating hate speech comments on comments sections on proprietary news and finance web articles.

TempEval is a series of shared tasks aimed at processing temporal information in text. Unusually for an NLP task, the top systems at TempEval typically include both statistical and rule based systems. Companies have a need to process temporal information in text in order to eg choose relevant documents to present to users, cluster and simplify document sets to reduce information overload, and present info graphics like timelines of financial events. In this talk I will present some work my team and I have been doing to process temporal information in text and highlight issues that arise when moving from carefully curated shared task data to noisy "real world" data.

In recent years, our ability to process large amounts of data has increased significantly, creating many opportunities for innovation. Having large quantities of data, however, does not necessarily turn into actionable insights that make a difference for users in consumer applications. In this talk I will give a quick overview of some ways in which “big data” can be used in industry, with a particular focus on Human-Centered approaches to innovation. In particular, I will discuss how the combination of qualitative and quantitative methods can be of benefit, giving examples around social media and giving an overview of some of the areas of research I am currently focusing on at Yahoo!. Within this context, I will outline a blueprint for a research framework as it applies to innovation, and discuss specific technical approaches within that framework. I will argue on the importance of taking a human-centered view and highlight what I consider the most fundamental problems in computer science today from that perspective.

3/27/15
Challenges and opportunities in statistical neural data analysis by Liam Paninski (Columbia University, Statistics)

Abstract: Systems and circuit-level neuroscience has entered a golden age: with modern fast computers, machine learning methods, and large-scale multineuronal recording and high-resolution imaging techniques, we can analyze neural activity at scales that were impossible even five years ago. One can now argue that the major bottlenecks in systems neuroscience no longer lie just in collecting data from large neural populations, but rather in understanding this data. I'll discuss several cases where basic neuroscience problems can be usefully recast in statistical language; examples include inference of network connectivity and low-dimensional dynamical structure from multineuronal spiking data.

4/3/15
Information Extraction Over Large Volumes of Text: Efficiency Challenges and A Key Public Health Application by Luis Gravano (Columbia SEAS, Computer Science)

Abstract: Information extraction systems identify and extract intrinsically structured data that is embedded in natural-language text documents, hence enabling next-generation web search, expressive SQL-style querying and data mining over the extracted data, and much more. Unfortunately, information extraction is a time-consuming process, often involving complex text analysis, so exhaustively processing all documents in a large text database --or on the web or social media - could be prohibitively expensive. In this talk, I will discuss ways in which we can improve the efficiency --and hence the scalability—of the information extraction process. I will also discuss an application of information extraction to an important problem in public health. Specifically, I will describe our ongoing collaboration with the New York City Department of Health and Mental Hygiene, to detect foodborne disease outbreaks in New York City restaurants through the analysis of social media documents.

4/17/15
Computational Social Science: Exciting Progress and Future Challenges by Duncan Watts (Microsoft Research)

Abstract: The past 15 years have witnessed a remarkable increase in both the scale and scope of social and behavioral data available to researchers, leading some to herald the emergence of a new field: “computational social science.” Against these exciting developments stands a stubborn fact: that in spite of many thousands of published papers, there has been surprisingly little progress on the “big” questions that motivated the field in the first place—questions concerning systemic risk in financial systems, problem solving in complex organizations, and the dynamics of epidemics or social movements, among others. In this talk I highlight some examples of research that would not have been possible just a handful of years ago and that illustrate the promise of CSS. At the same time, they illustrate its limitations. I then conclude with some thoughts on how CSS can bridge the gap between its current state and its potential.

5/1/15
Patterns of Large-Scale Attention by Mor Naaman (Cornell Tech)

Abstract: Complaints about information overload date back to medieval times, but not until recently the competition over our attention had become so fierce. At the same time, researchers now have new opportunities to capture and model the attention we collectively pay and use this data to generate new insights and applications. I give two examples of mining attention from different domains. First, I use reading depth data for online media to show attention patterns in online media and how they are dependent on factors like device, referral source and even features of the text. Second, I use geo-tagged social media data to show how pay attention to different hyper-local locations, how this attention is spread differently depending on service and device, and what new systems can be enabled by this information.