CS 86998/EE 6898/EECS 6898: Topics - Information Processing: From Data to Solutions, Fall 2013

Time: Friday, 1:10-3:00pm
Place: CSB 453 (CS Conference Room)


Shih-fu Chang (Office Hours: TBD) sfchang_AT_ee.columbia.edu, 212-854-6894

Noemie Elhadad (Office Hours: TBD) noemie_AT_dbmi.columbia.edu

Teaching Assistant: Jessica Ouyang (Office Hours: TBD) ouyangj_AT_cs.columbia.edu

Announcements | Academic Integrity | Description
Readings | Resources | Requirements | Syllabus


This course is designed for participants in the NSF IGERT program "From Data to Solutions".  Students in the seminar may be IGERT Trainees, IGERT affiliates, or other students having the permission of one of the instructors.  The course will consist of a series of presentations by faculty and staff at Columbia and CUNY who will describe interesting problems involving very large amounts of data (text, audio, image, video) that require interdisciplinary collaboration with faculty and students in Computer Science, Electrical Engineering, Statistics, Psychology, Biomedical Informatics, Business and Journalism.  Students taking the course will complete short reading assignments for each class, turn in one-page reports on each of the presentations, and prepare a final longer report on one of the problems presented as a final project.  Actual experimental implementations will be welcome, but not mandatory. Some proposed projects may be selected and invited to continue in the following semester or summer under the supervision of the instructors or other participating faculty or researchers from industry.  There are no prerequisites for the course and no exams; however, students who are members of the IGERT: From Data to Solutions project (Trainees and Affiliates) will have preference in enrollment.  This is a required course for IGERT Trainees.


Students will be expected to complete all reading assignments before the class for which they are assigned.  Students will prepare short reports on each of the presentations.  These must be submitted in CourseWorks before the following class. Each student will prepare a longer report outlining an approach to one of the interdisciplinary problems describe in the presentations.  There will be no midterm or final exam.  Grades will be based on class participation, weekly short reports, and final report.

A guide to weekly reporting can be found here.

An example can be found here.

Information on final reports and presentations can be found here and here.


Class participation: 30%

Short Reports 30%

Final Report 40%

Academic Integrity

Copying or paraphrasing someone's work (code included), or permitting your own work to be copied or paraphrased, even if only in part, is not allowed, and will result in an automatic grade of 0 for the entire assignment or exam in which the copying or paraphrasing was done. Your grade should reflect your own work. If you believe you are going to have trouble completing an assignment, please talk to the instructor or TA in advance of the due date.


Required readings are available online from links in the syllabus below.


There is no report due for Week 13.

The final reports are due on Wednesday 11 December at 1pm. Final presentations are also on Wednesday 11 December at 1pm in the EE conference room, Mudd 1306.



Date Topic Readings Presenters
Week 1 (9/6)

Title: Biomedical engineering and informatics applications in the intensive care unit

Description: Discussion of the increasing and essential role of biomedical engineering and biomedical informatics in intensive care medicine. The talk will span medical devices for patient monitoring, device integration and data collection, data analysis, and data visualization to facilitate medical decision making. Students should come to appreciate the tremendous unmet need for engineers in healthcare and the potential impact they could have on improving the lives of our sickest patients.


Hemphill11, Claassen13, Cohen10 Michael Schmidt is an Assistant Professor of Clinical Neuropsychology in Neurology at Columbia University College of Physicians and Surgeons. Dr. Schmidt received his undergraduate degree in psychology from Michigan State University and his doctorate in Neuropsychology from the City University of New York. Dr. Schmidt completed a post-doctoral research fellowship in the Division of Critical Care Neurology at Columbia University that lead to his current position in 2005. In 2009, Dr. Schmidt received a 3-year CTSA K12 career development award from the Columbia University Irving Institute for clinical and translational research. He completed a Master's degree in Biostatistics: Patient-Oriented Research from the Columbia School of Public Health in 2011. Dr. Schmidt is the Director of the Neuro-ICU Neuromonitoring and Informatics program and the Columbia University Undergraduate Research Internship in Neurology and Neurosurgery. Dr. Schmidt's interests concentrate on personalized medicine in the Neuro-ICU, including generation of patient-specific physiological targets and early detection of secondary complications related to critical brain injuries through real-time analysis of neurophysiological monitoring data, the use of clinical informatics to support patient management decisions within the intensive care unit, and identifying modifiable factors that drive health outcomes following critical brain injuries. His research as a co-investigator to determine patient status utilizing multimodal neuromonitoring data from critical brain injury patients is supported by the Dana Foundation.
Week 2 (9/13)

Speaker: Steve Lohr

Title: The Age of Big Data

Description: Big Data is a vague term, used loosely, if often, these days. But put simply, catchall phrase means three things. First, it is a bundle of technologies. Second, it is a potential revolution in measurement. And third, it is a point of view, or philosophy, about how decisions will be -- and perhaps should be -- made in the future. This talk will elaborate on those three themes. It will also describe the historical context for the technologies and mindset that now fly under the banner of Big Data, and touch on the promise and pitfalls of this approach to decision making.

Slides and script.

Speaker: Mark Hansen

Title: Database and/as narrative

John Tukey wrote that the clever data analyst need only "listen to what his data had to tell him." In this talk, I will present a series of art projects that pull stories from data. "Before Us Is the Salesman's House" is a recent work commissioned by eBay as part of the Zero1 Festival in San Jose, CA. Through it, Jer Thorp and I examine how to literally "read" one data set through another. "Exit," developed for the Fondation Cartier pour l'art contemporain, Paris, builds on curator and cultural theorist Paul Virilio's notion that what most defines humanity today are our patterns of migration. The installation visualizes the global movement of people, both forced and voluntary and due to various factors (whether political, economic, and environmental), through a series of six panoramic narratives displayed over the course of 42 minutes. Finally, I will describe "Shuffle," a performance created for the celebration of the New York Public Library's Centennial celebration. The piece is a site-specific mash-up of three texts performed by the Elevator Repair Service over the last decade, The Great Gatsby, The Sound and the Fury and The Sun Also Rises simultaneously.

Halevy09, The Fourth Paradigm (Foreword, pp xi-xv; Jim Gray on eScience, pp xvii-xxxi), Lohr13

Steve Lohr reports on technology, business, and economics. He was a foreign correspondent for the Times for a decade and served brief stints as an deitor, before covering technology, starting in the early 1990s. In 2013, he was part of the team awarded the Pulitzer Prize for Explanatory Reporting "for its penetrating look into business practices by Apple and other technology companies that illustrates the darker side of a changing global economy for workers and consumers." He has written for magazines including The New York Times Magazine, The Atlantic Monthly, and The Washington Monthly. He is the author of a history of computer programming, "Go To: The Story of the Math Majors, Bridge Players, Engineers, Chess Wizards, Maverick Scientists and Iconoclasts -- The Programmers Who Created the Software Revolution (Basic Books, 2001; paperback, 2002).

Mark Hansen is a professor in the Columbia University Graduate School of Journalism.

Week 3 (9/20)

Speaker: John Paisley

Title: Variational Inference and Big Data

Description: A scalable algorithm for approximating posterior distributions called stochastic variational inference. Stochastic variational inference lets one apply complex Bayesian models to massive data sets. This technique applies to a large class of probabilistic models and outperforms traditional batch variational inference, which can only handle small data sets. Stochastic inference is a simple modification to the batch approach, so a significant part of the discussion will focus on reviewing this traditional batch inference method.


Speaker: Daniel Hsu

Title: Machine learning and privacy

Many important applications of machine learning crucially rely on sensitive information collected about individuals (e.g., shopping habits, medical records, financial histories). The failure of conventional anonymization techniques have cause public embarrassment, and therefore indicate that privacy should be a first-order concern in the design of machine learning methods. This talk will give an overview of some recent research along these lines.


Hoffman13, Dwork10

John Paisley is an assistant professor of electrical engineering at Columbia University. He received his Ph.D. in electrical engineering from Duke University in 2010 and did post-docs in the computer science departments at Princeton University and UC Berkeley. He is interested in machine learning, particularly probabilistic models and inference techniques, Bayesian nonparametrics, dictionary learning and topic modeling.

Daniel Hsu is an assistant professor in the Department of Computer Science at Columbia University. Previously, he was a postdoc at Microsoft Research New England from 2011 to 2013; before that, he was a postdoc with the Department of Statistics at Rutgers University and the Department of Statistics at the University of Pennsylvania from 2010 to 2011, supervised by Tong Zhang and Sham M. Kakade. He received his Ph.D. in Computer Science in 2010 from the Department of Computer Science and Engineering at UC San Diego, where he was advised by Sanjoy Dasgupta. He received his B.S. in Computer Science and Engineering in 2004 from the Department of Electrical Engineering and Computer Sciences at UC Berkeley. His research interests are in algorithmic statistics and machine learning.

Week 4 (9/27)

Title: Identifying Deception from Speech

Abstract: There has been considerable interest in recent years in automatic methods of detecting deception to supplement current human and polygraph approaches, especially using new sources of information. Evidence of deception appears in many dimensions: biometric information, body gesture, facial expression, written words, and speech characteristics. Our focus is on detection deception from acoustic/prosodic and lexical cues in speech. We are collecting large corpora of deceptive and non-deceptive speech to study how speakers vary their productions when lying and telling the truth. Our machine learning experiments predicting deception achieve performance which compares favorably with the performance of human judges on the same data and task. We find that personality factors may be a key factor in successful human judgments, based on our perception studies, and hypothesize that these may also play an important role in the individual differences we find in production.

Hirschberg05, Enos06 Julia Hirschberg (Engineering)
Week 5 (10/4)

Title: Transforming the Impossible to the Natural

Abstract: Reading science fiction over the past one hundred years, one sees many seemingly impossible machines and services, which are now not only widely available, but have become accepted as natural. In this talk, I will share examples that show how technologies developed in research labs have impacted real life user experiences. For example, body gesture, speech, natural user intent understanding, and other new usage scenarios have all recently impacted how users utilize computing. Looking forward, I see exciting opportunities for research to further extend what is considered natural when using computers. What's natural in computing at the end of 21st century will be drastically different than what we find common today.


Chen13, Wang12, Weng13, Zheng, Zhang13 Hsiao-Wuen Hon (Microsoft Research Asia)
Week 6 (10/11)

Title: The Changing Landscape for Research in Education

Abstract: The talk will review new opportunities for research in education created by the growth of learning technologies and the data they generate. Particular examples will be drawn from applications developed at the Teachers College EdLab. The talk will conclude with a discussion of the evolving infrastructure for gathering and managing data on educational applications being developed for use by investigators on campus and beyond.


Malhotra13 (extended abstract), Natriello12 (optional) Gary Natriello is a Professor of Sociology and Education at Teachers College and Director of the Gottesman Libraries and the EdLab. His research interests include the education of at-risk youth, school organizations, performance evaluation, and online learning.
Week 7 (10/18)

Title: Para Empirical Data and Visualization

Abstract: I will be showing work in progress from a recent project a GSAPP: The Advance Data Visualization Project. Our work responds to the fact that as the world faces an ever-growing deluge of data, that the demands for new and innovative ways of thinking become increasingly important. Developing clear, sophisticated, and accessible visualizations for existing and future data sets is a vital part of understanding and leading an increasingly data-centric world. Hosted at GSAPP, the ADVP brings together an interdisciplinary group from within Columbia (The Library, The Journalim School, the Mind, Brain, Behavior Institute) and beyond to encourage diverse thinking around all facets of data visualization.

SIDL06 Laura Kurgan is an Associate Professor of Architecture, Director of Visual Studies, and Director of the Spatial Information Design Lab (SIDL) at the Graduate School of Architecture, Planning and Preservation at Columbia University. She is the author of: Close Up At a Distance: Mapping, Technology and Politics, Zone Books in 2013. Kurgan's work has exhibited internationally including at the Museum of Modern Art in New York, the Cartier Foundation in Paris, the Venice Architecture Biennial, MACBA in Barcelona, the ZKM in Karlsruhe, and the Whitney Museum of American Art. In 2012 Kurgan was announced as a Game Changer in Metropolis Magazine, and in 2009, she was awarded a United States Artists Rockefeller Fellowship.
Week 8 (10/25)

Title: Detecting Contrary Meaning in Online Conversations

Abstract: The challenge we address in our research is the automatic detection of contrary meaning in conversational data. Contrary meaning can be implicit, such as the use of sarcasm where the authors state the opposite of what they actually mean (their actual attitudes or beliefs), or explicit such as the presence of overtly expressed conflicting statements/beliefs. In this talk, I will present the motivation and challenges for detecting contrary meaning in conversational data, as well as our proposed solutions for addressing this problem. Our approach for sarcasm detection uses machine learning to classify sarcastic vs. non-sarcastic utterances using a combination of lexical and contextual features. In addition, I will describe several crowdsourcing experiments we are currently conducting to collect large-scale annotated data of sarcastic messages. For detection of conflicting statements/beliefs, I will present our approach, which frames the problem as a 2-way Textual Entailment problem.

Gonzalez11, deMarneffe08 Smaranda Muresan is a Research Scientist at the Center for Computational Learning Systems at Columbia University. Her research interests are at the intersection of natural language processing and machine learning. Her research is focused on grammar induction, computational semantics, language in social media, and applications to computational social science and health informatics. Her work is funded primarily by DARPA and NSF. She received her PhD in Computer Science from Columbia University in 2006. From September 2006 to August 2008 she was a Postdoctoral Researcher at the Institute for Advanced Computer Studies, University of Maryland College Park, working on machine translation. Before joining CCLS in September 2013, she was an Assistant Professor in the Department of Library and Information Science and a Graduate Faculty in the Department of Computer Science at Rutgers University. At Rutgers, she was the co-founder and co-director of the Laboratory for the Study of Applied Language Technology and Society.
Week 9 (11/1)

Title: Health Care Coordination and a Multi-Agent Systems "Turing Challenge"

Abstract: I recently argued that Turing, were he alive now, would conjecture differently than he did in 1950, and I suggested a new "Turing challenge" question, "Is it imaginable that a computer (agent) team member could behave, over the long term and in uncertain, dynamic environments, in such a way that people on the team will not notice it is not human." In the last several decades, the field of multi-agent systems has developed a vast array of techniques for cooperation and collaboration as well as for agents to handle adversarial or strategic situations. Even so, current generation agents are unlikely to meet this new challenge except in very simple situations. Meeting the challenge requires new algorithms and novel plan representations. This talk will explore the implications of this new "Turing question" in the context of my group's recent work on developing intelligent agents able to work on a team with health care providers and patients to improve care coordination. Our goal is to enable systems to support a diverse, evolving team in formulating, monitoring and revising a shared "care plan" that operates on multiple time scales in uncertain environments. The coordination of care for children with complex conditions, which is a compelling societal need, is presented as a model environment in which to develop and assess such systems. The talk will focus in particular on challenges of interruption management, information sharing, and crowdsourcing for health literacy.

Amir13, Grosz12 Barbara Grosz is Higgins Professor of Natural Sciences in the School of Engineering and Applied Sciences at Harvard University. From 2001-2011, she served as dean of science and then dean of the Radcliffe Institute for Advanced Study at Harvard. Grosz is known for her seminal contributions to the fields of natural-language processing and multi-agent systems. She developed some of the earliest computer dialogue systems and established the research field of computational modeling of discourse. Her work on models of collaboration helped establish that field and provides the framework for several collaborative multi-agent and human-computer interface systems. Grosz is a member of the National Academy of Engineering, the American Philosophical Society, and the American Academy of Arts and Sciences and a fellow of the Association for the Advancement of Artificial Intelligence (AAAI), the Association for Computing Machinery, and the American Association for the Advancement of Science. In 2009, she received the ACM/AAAI Allen Newell Award for "fundamental contributions to research in natural language processing and in multi-agent systems, for her leadership in the field of artificial intelligence, and for her role in the establishment and leadership of interdisciplinary institutions." She served as president of the AAAI from 1993-1995 and on the Boards of IJCAI (Chair 1989-91) and IFAAMAS.
Week 10 (11/8)

Title: Leveraging social networks for toxicovigilance

Abstract: We'll begin with a review of prior efforts for social network message analysis, for biosurveillance, disaster response, depression, and now, digital drug detection. We'll discuss some challenges and shortcomings of recent efforts - from properly identifying subjects and content, to location, to properly inferring meaning - and how we're attempting to address them. Then we'll deep dive into our use of TFIDF and context-free grammars and how it's giving us useful results thus far.

Slides, slides


Nicholas Genes, MD, PhD, is Assistant Professor in the Department of Emergency Medicine at the Mount Sinai School of Medicine in New York City. Dr. Genes graduated from Brown University, received his MD and PhD from the University of Massachusetts Medical School, and completed Emergency Medicine residency training, chief residency, and a fellowship in informatics at the Mount Sinai School of Medicine. Dr. Genes has studied ways to enhance clinical documentation and patient flow through the emergency department, and developed tools for evaluating the utility of health information exchanges (HIE). He is also involved in studies of EHR usability, HIE as a quality improvement tool, and HIE notifications. He has written on the novel use of HIE for biosurveillance, to document emergency department crowding during the Spring 2009 H1N1 outbreak in New York City, and has since focused his research on the use of social media for improving public health and emergency preparedness.

Michael Chary is an MD/PhD student at Mount Sinai School of Medicine. He completed his PhD in computational neuroscience and electrophysiology in the laboratory of Dr. Ehud Kaplan studying how cocaine influences the representation of information in the mesolimbic system. He studied computer science, neuroscience, and biochemistry at New York University. Michael conducts clinical research in emergency medicine and neurosurgery, as well as basic research into the mechanisms of deep brain stimulation and the computational complexity of language, using the tools for multivariate analysis in large data sets he developed in his dissertation.

Week 11 (11/15)

Title: Causal Inference from Complex Observational Data

Abstract: One of the key problems we face with the accumulation of massive datasets (such as from electronic health records, financial markets, and social networks) is the transformation of data to actionable knowledge. In order to use the information gained from analyzing these data to intervene to, say, treat patients or create new fiscal policies, we need to know that the relationships we have inferred are causal. Further, we need to know the time over which the relationship takes place, and what other factors are needed for the cause to be effective in order to intervene. This talk will discuss the challenges inherent in inference from observational data, recent work addressing these, and the current limits of causal inference.


Kleinberg11 Samantha Kleinberg is an Assistant Professor of Computer Science at Stevens Institute of Technology. She received her PhD in Computer Science from New York University in 2010 and was a Computing Innovation Fellow at Columbia University in the Department of Biomedical informatics from 2010-2012. Her research centers on developing methods for analyzing large-scale, complex, time-series data. In particular, her work develops methods for finding causes and automatically generating explanations for events, facilitating decision-making using massive datasets. She is the author of Causality, Probability, and Time (Cambridge University Press, 2012), and PI of an R01 from the National Library of Medicine.
Week 12 (11/22)

Title: Data-Enabled New Paradigm for Civil Infrastructure Management

Abstract: Elaborate infrastructure systems lie at the heart of the U.S economy and are an essential part of the lives of all of us. The vast majority of these systems are over 40 years old. Many are approaching one hundred. They are not only deteriorating from age, but are also increasingly vulnerable to natural or man-made threats. In the context of ongoing budgetary conflict and crises, a new paradigm is required to meet the challenges of intelligent infrastructure management.

This talk explores the emerging applications of sensor-based technology to monitor such systems. By monitoring structural integrity and providing objective, quantitative data, sensor-based monitoring technology has the potential to identify and prioritize those systems most in need of repair. Limited resources can then be most effectively directed. The major challenges that limit wider use of structural health monitoring will be discussed, along with recent efforts to address these challenges by developing innovative sensors meeting unique demands and advanced data mining tools for reliable structural health diagnosis.


Carden04 Maria Feng is Renwick Professor at the Department of Civil Engineering and Engineering Mechanics, Columbia University. Her research is on the forefront of multidisciplinary science and engineering in sensors, structural health monitoring, intelligent structures and system control for civil infrastructure and military applications. She has made a number of original contributions to the state-of-the-art in both academic research and engineering practice through the development of a number of novel sensors and algorithms for damage detection and assessment based on sensor data. Professor Feng's achievements have been recognized by her election as a Fellow of the American Society of Civil Engineers (ASCE) and numerous national and international awards, including the CAREER Award by National Science Foundation, the Collingwood Prize by ASCE, the Alfred Noble Prize awarded jointly by ASCE, ASME, IEEE, AIMMPE, and WSE, the Water L. Huber Civil Engineering Research Prize by ASCE, and best paper awards by a number of international professional societies. She was recognized as a Top Researcher in Wearable Sensors by the MIT Technology Review.
Week 13 (12/6)

Title: Introduction to University Tech Transfer

Abstract: CTV manages more than 300 invention disclosures from faculty, 70 license deals and 15 new start-ups each year, leveraging approximately 45 multi-disciplinary, full-time staff across Columbia's two campuses. CTV currently has over 1200 patent assets available for licensing, across research fields such as bio, IT, cleantech, medical devices, nanotechnology, and material science. Revenues generated then flow back into Columbia, to be shared with the inventor, the inventor.s lab, the department, the school, and the University overall. Columbia has a particular focus on start-up companies. Over the years, CTV has launched over 140 companies based on Columbia's technologies, over 90 of which are still active today. Of these 140+ companies, 35 were venture-backed, and 20+ have been sold or gone public.


Title: Patents 101

Abstract: As the Columbia technology transfer experience demonstrates, patents can provide a successful vehicle for generating research funding for university inventions. Unfortunately, many seemingly harmless events occurring after the conception of an invention can jeopardize the right to file for a patent. This presentation will provide a patent law primer focusing on what you need to know to preserve patent protection for your inventions.


Sample patent

Orin Herskowitz is the VP of Intellectual Property and Tech Transfer for Columbia University, as well as Executive Director of Columbia Technology Ventures (CTV). Orin received his BA from Yale and his MBA from the Wharton School of Business. Prior to joining Columbia, Orin spent 7 years at the Boston Consulting Group's New York office as a strategy consultant, and was previously an entrepreneur and a consultant to start-ups.

Jeff Sears serves as Chief Patent Counsel and an Associate General Counsel for Columbia. His practice encompasses all aspects of patent law, including prosecution, portfolio management, strategic counseling, licensing and post-licensing compliance, litigation, and legislative and regulatory patent matters. Jeff has practiced patent law for more than a dozen years. He joined Columbia in 2005, after having spent five years in private practice. Jeff holds an S.B. in physics from MIT, an M.A. and Ph.D. in physics from SUNY Stony Brook, and a J.D. from NYU. He is admitted to practice law in New York and before the U.S. Patent and Trademark Office.

Study Days (12/11) Final reports and presentations (Mudd 1306)