David J. Malan
Thursday April 25, 2013
CS50 is Harvard College’s introductory course for majors and non-majors alike, enrollment in which both rose and fell along with the dotcoms. Although enrollment peaked in 1996 at 386 students, it had settled by 2002 in the neighborhood of 100. We set out in 2007 to combat that trend by tackling two problems. We hypothesized that the course suffered from one of perception and one of design. Although the course was well-regarded on campus, the consensus was nonetheless to beware. And the course’s own syllabus may very well have been dated in the eyes of students who had begun to carry the very latest in hardware and software. Not only, then, did we proceed to revamp every one of CS50’s problem sets, we brought its syllabus more in line with technological trends already familiar to students. And we altered the tone of the course to appeal to “those less comfortable” with computing, cultivating a new culture based on shared experience. But we took care to preserve the course’s rigor and underlying fundamentals, lest we do our own students a disservice.
Our new approach appears to be working. Between 2006 and 2007, enrollment in CS50 more than doubled from 132 to 282 (+114%). Between 2007 and 2008, enrollment increased another 17% to 330, though even more striking was that year’s 48% increase in female enrollment. In 2010, enrollment surpassed 1996’s peak. And in 2012, the course became the second-largest course on campus, with an enrollment of 714.
We present in this talk what we have done and why we have done it. And we offer a look at CS50x, Harvard College’s first course to be offered on an even larger scale via edX.
David J. Malan is a Senior Lecturer on Computer Science at Harvard College for the School of Engineering and Applied Sciences. He received his A.B., S.M., and Ph.D. in Computer Science from the same in 1999, 2004, and 2007, respectively. His research in graduate school focused primarily on cybersecurity and computer forensics. His more recent publications focus on instructional technologies and pedagogy. He now teaches the College’s introductory computer science course, Computer Science 50, otherwise known as CS50, which is available as OpenCourseWare atcs50.tv
as well as via edx.org
DISTINGUISHED LECTURE SERIES
Expanding Capabilities for Functional Encryption
Allison Bishop Lewko
Monday, February 25, 2013
ABSTRACT: Functional Encryption represents a new vision for designing cryptosystems that simultaneously achieve flexible data sharing and strong security. Using a functional encryption scheme, a data owner can specify an access policy that will be enforced by the encryption scheme itself, allowing only authorized users to decrypt. In this talk, I will present functional encryption schemes supporting expressive access policies while retaining provable security guarantees against powerful adversaries. In particular, I will describe a new decentralized system enabling data encryption across different trust domains.
BIOGRAPHY: Allison Bishop Lewko is a postdoctoral researcher at Microsoft Research New England. She completed her Ph.D. in computer science at The University of Texas at Austin in May 2012. She received her bachelorÃ¢â‚¬â„¢s degree in mathematics from Princeton University and a masters in mathematics from The University of Cambridge. Much of her research is focused on developing advanced cryptosystems with strong security guarantees, though she is also active in the areas of complexity theory, distributed computing, harmonic analysis, and combinatorics.
FACULTY CANDIDATE SEMINARS
Human-Powered Data Management
Monday, February 11, 2013
Fully automated algorithms are inadequate for a number of data analysis tasks, especially those involving images, video, or text. Thus, there is a need to combine human computation (or crowdsourcing), together with traditional computation, in order to improve the process of understanding and analyzing data. My thesis addresses several topics in the general area of human-powered data management. I design algorithms and systems for combining human and traditional computation for: (a) data processing, e.g., using humans to help sort, cluster, or clean data; (b) data extraction, e.g., having humans help create structured data from information in unstructured web pages; and (c) data gathering, i.e., asking humans to provide data that they know about or can locate, but that would be difficult to gather automatically. My focus in all of these areas is to find solutions that expend as few resources as possible (e.g., time waiting, human effort, or money spent), while still providing high quality results.
In this talk, I will first present a broad perspective of our research on human-powered data management, and I will describe some systems and applications that have motivated our research. I will then present details of one of the problems we have addressed: filtering large data sets with the aid of humans. Finally I will argue that human-powered data management is an area in its infancy, by describing a number of open problems I intend to address in my future research program.
BIOGRAPHY: Aditya Parameswaran is a Ph.D. student in the InfoLab at Stanford University, advised by Prof. Hector Garcia-Molina. He is broadly interested in data management, with research results in human computation, information extraction, and recommendation systems. Aditya is a recipient of the Key Scientific Challenges Award from Yahoo! Research (2010), two best-of-conference citations (VLDB 2010 and KDD 2012), and the Terry Groswith graduate fellowship at Stanford University.
Provable Bounds in Machine Learning
Thursday, February 14, 2013
Machine learning is a vibrant field with many rich techniques.
However, these approaches are often heuristic: we cannot prove good bounds on either their performance or their running time except in limited settings. This talk will focus on designing algorithms whose performance can be analyzed rigorously. As an example of this project, I will describe my work on the nonnegative matrix factorization problem, which has important applications throughout machine learning, statistics and optimization. As is often the case, this problem is NP-hard when considered in full generality. However, we introduce a sub-case called “separable” nonnegative matrix factorization that we believe is the right notion in various contexts. We give a polynomial time algorithm for this problem, and leverage this algorithm to efficiently learn the topics in the popular Latent Dirichlet Allocation model (and others too). This is an auspicious example where theory can lead to inherently new algorithms that have highly-practical performance on real data sets.
I will also briefly describe some of my other work on optimization, including vertex sparsification and lower bounds for linear programs via information theory.
BIOGRAPHY: Ankur Moitra is an NSF CI Fellow at the Institute for Advanced Study, and also a senior postdoc in the computer science department at Princeton University. He completed his PhD and MS at MIT in 2011 and 2009 respectively, where he was advised by Tom Leighton. He received a George M. Sprowls Award (best thesis) and a William A. Martin Award (best thesis) for his doctoral and master’s dissertations. He has worked in numerous areas of algorithms, including approximation algorithms, metric embeddings, combinatorics and smoothed analysis, but lately has been working at the intersection of algorithms, statistics and machine learning.
Unraveling the heterogeneity and diversity of regulatory elements in the human genome
Wednesday, February 20, 2013
In 2003, the Human Genome Project marked a major scientific milestone by releasing the first consensus DNA sequence of a complete human genome. The ENCODE (Encyclopedia of DNA elements) project was launched to pick up where the Human Genome Project left off, with the goal of systematically deciphering the function of every base (letter) in the genome. Over the last 4 years, ENCODE has generated extensive genome-wide functional genomic data measuring the cellular activity of thousands of cellular moieties in a variety of normal and diseased cellular contexts. In this talk, I will describe novel computational and machine learning approaches that I developed for integrative analysis of ENCODE data in order to unravel the functional heterogeneity of regulatory elements in the human genome and their implications in human disease.
I will begin with a gentle introduction to the diversity and scale of ENCODE data and a brief overview of robust, statistical methods that I developed for automated detection of DNA binding sites of regulatory proteins from massive collections of noisy, experimental data. Regulatory proteins can perform multiple functions by interacting with and co-binding DNA with different combinations of other regulatory proteins. I developed a novel discriminative machine learning formulation based on regularized rule-based ensembles that was able to sort through the combinatorial complexity of possible regulatory interactions and learn statistically significant item-sets of co-binding events at an unprecedented level of detail. I discovered a large number of novel pairwise and higher-order interactions, several of which were experimentally validated. I found extensive evidence that regulatory proteins could switch co-binding partners at different sets of regulatory domains within a single cell-type and across different cell-types thereby affecting patterns of other chemical modifications to DNA and regulating different functional categories of target genes. Finally, I will present a novel approach that we developed to exploit ENCODE data to significantly improve interpretation of human disease studies. Massive case-control studies involving comparative analysis of DNA sequences of diseased and healthy individuals have been reasonably successful at identifying genomic variants (mutations) associated with various human diseases. However, understanding the functional impact of these mutations has been very challenging. Using functional elements discovered from ENCODE data, we were able to identify and prioritize functional variants, provide a functional annotation for up to 81% of all publicly available disease-associated variants and generate new hypotheses by integrating multiple sources of data.
Together, these efforts take us one step closer to learning unified models of regulatory mechanisms in humans in order to improve our system-level understanding of cellular processes and complex diseases.
BIOGRAPHY: Anshul Kundaje is a Research Scientist in the Computer Science Department at the Massachusetts Institute of Technology and the Broad Institute of MIT and Harvard. His primary research interest is large-scale computational genomics. He specializes in developing statistical and machine learning methods for integrative analysis of heterogeneous, high-throughput functional genomics data to decipher functional elements, learn dynamic models of gene regulation and improve interpretation of disease studies. He completed his PhD in Computer Science at Columbia University (2003-2008), under the guidance of Dr. Christina Leslie where he developed novel machine learning methods based on Boosting algorithms to learn models of gene regulation in yeast. He conducted postdoctoral research (2008-2012) in the Computer Science department at Stanford University, mentored by Profs. Serafim Batzoglou and Arend Sidow. He served as the lead data-coordinator and primary integrative analyst with the human ENCODE (Encyclopedia of DNA Elements) consortium where he developed novel computational approaches for large-scale, integrative analysis of diverse functional genomics data to decipher the complexity, interactions and regulatory potential of functional elements in humans. Currently at MIT, he works with Prof. Manolis Kellis’s group as one of the primary computational analysts of the NIH Roadmap Epigenomics Project to understand the variation of functional genomic signals across individuals, cell-types and organisms. He is now keen to start his own research program and to collaborate extensively with computational and experimental groups to explore novel, large-scale machine learning approaches to improve our system-level understanding of cellular regulatory mechanisms, human health and disease.
SCALING DEEP LEARNING TO 10,000 CORES AND BEYOND
Quoc V. Le
Monday, March 4, 2013
Deep learning and unsupervised feature learning offer the potential to transform many domains such as vision, speech, and natural language processing. However, these methods have been fundamentally limited by our computational abilities, and typically applied to small-sized problems. In this talk, I describe the key ideas that enabled scaling deep learning algorithms to train a large model on a cluster of 16,000 CPU cores (2000 machines). This network has 1.15 billion parameters, which is more than 100x larger than the next largest network reported in the literature.
Such network, when applied at the huge scale, is able to learn abstract concepts in a much more general manner than previously demonstrated. Specifically, we find that by training on 10 million unlabeled images, the network produces features that are very selective for high-level concepts such as human faces and cats. Using these features, we also obtain breakthrough performance gains on several large-scale computer vision tasks.
Thanks to its scalability and insensitivity to modalities, our framework is also used successfully to achieve performance leaps in other domains, such as speech recognition and natural language understanding.
BIOGRAPHY: Quoc V. Le is a PhD student at Stanford and a software engineer at Google. At Stanford and Google, Quoc works on large scale brain simulation using unsupervised feature learning and deep learning. His recent work in deep learning and big data, that yields state-of-the-art performances in many pattern recognition tasks, was widely distributed and discussed on various technology blogs and news sites. Quoc obtained his undergraduate degree with First Class Honours and Distinguished Scholar at the Australian National University. During his undergruaduate, he worked on kernel methods, ranking, structured output prediction and optimization. His undergraduate honours thesis work won best paper award as ECML 2007. Quoc was also a researcher at National ICT Australia, Microsoft Research and Max Planck Institute of Biological Cybernetics.
New Systems, Algorithms, and Data Structures for High Availability
Monday, March 11, 2013
Users of Internet services are increasingly intolerant of delays and outages, while demanding an increasingly rich and consistent online experience. One screenshot, and a website that is down or misbehaving spreads through the news like wildfire. Yet protecting against such failures — whether due to misconfigurations, bugs, or even malice — is notoriously expensive. At the same time, the pressure to increase availability pushes system designers to cut corners, sometimes with disastrous consequences.
In this talk, I will describe new systems and theory for delivering high availability to users, inspired by real-world debacles, some of which are unknown to the public. A unifying approach in my research is to integrate theoretical innovations into the final stages of system design, giving robust guarantees that are also practical. First, I will describe a system that allows any Internet service to tolerate arbitrary faults scalably on read-mostly workloads, as well as new algorithms to scale this fault tolerance to a million nodes on general workloads. Then, I will describe an indexing technique used by database systems to increase transaction concurrency, and theoretically explain why it failed for one company, leading to the design of new data structures that are being incorporated into algorithms textbooks. I will also briefly discuss my other work on high availability, such as a system for saving energy on enterprise desktops, and my ongoing work on a system for rich analysis of globally distributed, real-time data.
BIOGRAPHY: Siddhartha Sen is a computer science PhD candidate at Princeton University and a junior research scientist at New York University. He designs and builds distributed systems that are provably scalable and robust, by tackling theoretical questions that arise during system creation. He received his BS degrees in computer science and mathematics and his MEng degree in computer science from MIT. He spent three years at Microsoft building a network load balancer for Windows Server, holding several patents for this work. Siddhartha received the first Google Fellowship in Fault-Tolerant Computing in 2009 and the best student paper award at PODC in 2012.
Interacting with Small Devices in Big Ways
Wednesday, March 13, 2013
ABSTRACT: Despite their small size, mobile devices are able to perform tasks of creation, information and communication with unprecedented ease. However, diminutive screens and buttons mar the user experience, and otherwise prevent us from realizing the full potential of computing on the go. In this talk, I will first discuss strategies I have pursued to expand and enrich interaction. For example, fingers have many “modes”â€”they do not just poke, as contemporary touchscreen interaction would suggest, but also scratch, flick, knock, rub, and grasp, to name a few. I will then highlight an emergent shift in computing: from mobile devices we carry to using everyday surfaces for interaction, including tables, walls, furniture and even our skin, bringing computational power ever closer to users. This evolution brings significant new challenges in sensing and interaction design. For example, the human body is not only incredibly irregular and dynamic, but also comes in more than six billion different models. However, along with these challenges also come exciting new opportunities for more powerful, intuitive and intimate computing experiences.
BIOGRAPHY: Chris Harrison is a Ph.D. candidate in Human-Computer Interaction at Carnegie Mellon University. He broadly investigates novel sensing technologies and interaction techniques, especially those that empower people to interact with “small devices in big ways.” Harrison was recently named as one of the top 30 scientists under 30 by Forbes, a top 35 innovator under 35 by MIT Technology Review, and one of six innovators to watch in 2013 by Smithsonian. He has been awarded fellowships by Google, Microsoft Research and Qualcomm, and has won several best paper awards. When not in the lab, Chris can be found welding sculptures, blowing glass, and visiting remote corners of the globe.
Online Learning: Theory and Algorithms
University of Pennsylvania
Monday, March 25, 2013
Online learning has emerged as an important topic in machine learning with implications in various other subfields of computer science and game theory. Although online learning has been steadily gaining popularity, there is a lack of generic tools to characterize learnability and optimal rates for general online learning problems. In this talk I will first provide sequential complexity tools that characterize online learning rates for general online learning problems. The goal of this first part of the talk can be seen as developing VC theory analog for online learning.
The main goal of the second part of the talk is to convert the theory developed in first part into (usable) algorithms. Specifically I shall provide a general recipe to derive efficient online learning algorithms for various online learning problems starting from the sequential complexity tools provided in the first part. Most existing online learning algorithms can be seen as arising out of this recipe. Several new and efficient online learning algorithms are also developed based on this recipe. Specifically we shall see how the recipe provides us with new algorithms for online matrix prediction problems, online link classification problems and new projection free online convex optimization algorithms.
BIOGRAPHY: Karthik Sridharan is currently a postdoctoral research scholar at the University of Pennsylvania. He received his PhD in computer science from the Toyota Technological Institute at Chicago in 2012. His research interests include machine learning, optimization, empirical process theory and game theory.
Inferring network architecture in biological systems: understanding temporal processes that control immune cell function
Broad Institute of Harvard & MIT
Wednesday, April 3, 2013
Complex, interacting systems are omnipresent in the world – from our social networks to the molecular circuits that guide the behavior of our cells. Charting out the connectivity of such systems and understanding their function has proven an extremely important, yet often daunting task. In this talk I will describe my work on modeling networks of molecules that control complex temporal processes in the immune system and explain how we used the resulting models to gain insight into biological organizational principles and to identify key regulators of autoimmunity.
Our modeling strategy relies on the integration of genome-scale datasets of various types, including temporal activity of individual components, probabilities for physical interactions between components, and causal relationships (i.e., perturbing component “A” affects the activity of “B”). The model inference is done iteratively such that putative key regulators are automatically chosen for perturbation experiments, which then serve for validating, refining and annotating the model. Formulating the ensuing data as network structures allows us to use graph-theoretic optimization algorithms as a modeling tool. For instance, we have developed an algorithm that combines a global optimization objective (directed Steiner trees) with a local one (shortest paths), proven its approximation bounds, and adapted its implementation for handling large-scale and real-world biological data.
Using this strategy, we investigated the differentiation of naÃƒÂ¯ve T cells into autoimmune-inducing Th17 T helper cells, which, despite enormous clinical importance, remain poorly understood. We computationally derived and then experimentally validated a temporal model of the regulatory network that controls the differentiation process. The network consists of two self-reinforcing, but mutually antagonistic, modules that control the Th17 phenotype and collectively achieve appropriate balance between Th17 cells and other competing T cell lineages. Specifically, we identified several molecular circuits that are crucial to blocking the pathogenesis of Th17-dependent autoimmunity by shifting the balance toward the differentiation of immunosuppressive T cells. Overall, our study identified 41 regulatory molecules that affect Th17 differentiation, characterized their interactions, and highlighted novel drug targets for a host of autoimmune diseases.
BIOGRAPHY: Nir Yosef is a postdoctoral associate at the Broad Institute and at the Center for Neurologic Diseases, Brigham and Women’s Hospital, Harvard Medical School (since 2010). Working with Aviv Regev (Broad) and Vijay K. Kuchroo (Harvard), Yosef is developing computational methods for analyzing diverse high-throughput data sets, with the goal of better understanding gene regulation in mammalian cells. His main interest is in dynamic regulatory circuits, focusing on the immune system as a model for both long-term processes (differentiation) and short-term responses (exposure to pathogens). Yosef earned his Ph.D. in Computer Science from the Tel Aviv University in Israel.
Teaching People and Machines How to Enhance Images
University of California, Berkeley
Thursday, April 4, 2013
ABSTRACT: Procedural tasks such as following a recipe or editing an image are very common. They require a person to execute a sequence of operations (e.g. chop onions, or sharpen the image). People commonly use step-by-step tutorials to learn these tasks. We focus on the domain of photo manipulations and have developed tools and techniques to help people learn, compare and automate photo manipulation procedures. We present a demonstration-based system for automatically generating succinct step-by-step visual tutorials of photo manipulations. Our tutorials are quick to generate. Moreover, users are faster and make fewer errors with our tutorials compared to book and video tutorials. We also demonstrate a new interface that allows learners to browse, analyze and compare large collections (i.e. thousands) of photo manipulation tutorials based on their command-level structure. Finally, we present a framework for generating content-adaptive macros (programs) that can transfer complex photo manipulation procedures to new target images. Together these tools allow people to learn, compare and automate procedural knowledge.
Procedural tasks such as following a recipe or editing an image are very common. They require a person to execute a sequence of operations (e.g. chop onions, or sharpen the image). People commonly use step-by-step tutorials to learn these tasks. We focus on the domain of photo manipulations and have developed tools and techniques to help people learn, compare and automate photo manipulation procedures. We present a demonstration-based system for automatically generating succinct step-by-step visual tutorials of photo manipulations. Our tutorials are quick to generate. Moreover, users are faster and make fewer errors with our tutorials compared to book and video tutorials. We also demonstrate a new interface that allows learners to browse, analyze and compare large collections (i.e. thousands) of photo manipulation tutorials based on their command-level structure. Finally, we present a framework for generating content-adaptive macros (programs) that can transfer complex photo manipulation procedures to new target images. Together these tools allow people to learn, compare and automate procedural knowledge.
Database Cracking and the Path Towards Auto-tuning Database Kernels
Monday, April 8, 2013
Today, businesses and sciences create more data than what we can store, move, let alone analyze. A fundamental problem with big data is that data management systems require extensive tuning (indexing) and installation steps; by the time we finish tuning, more data have already arrived. To make things worse, tuning a database system requires knowledge of what to tune for, i.e., we need to know the kind of queries we will be posing. However, in several modern big data applications, we are often in need of exploring new data quickly, searching for interesting patterns without knowing a priori exactly what we are looking for.
Database cracking removes completely the need for index-tuning in database systems. With database cracking indices are built incrementally, adaptively and on demand; each query is seen as an advice on how data should be stored. With each incoming query, data is reorganized on-the-fly as part of query processing, while future queries exploit and continuously enhance this knowledge. Autonomously, adaptively and without any external human administration, the database system quickly adapts to a new workload and reaches optimal performance when the workload stabilizes.
Stratos Idreos holds a tenure track senior researcher position with CWI, the Dutch National Research Center for Mathematics and Computer Science. The main focus of his research is on designing adaptive database architectures, tailored for big data exploration. He also works on stream processing, distributed query processing and scientific databases. Stratos obtained his Ph.D. from CWI and University of Amsterdam. In the past, he has also been with the Technical University of Crete, Greece, and held research internship positions with Microsoft Research, Redmond USA, with EPFL, Switzerland and with IBM Almaden USA. Stratos won the 2011 ACM SIGMOD Jim Gray Doctoral Dissertation award for his thesis on database cracking, and the 2011 ERCIM Cor Baayen award as “most promising European young researcher in computer science and applied mathematics”. In 2010 he was awarded with the IBM zEnterpise System Recognition Award by IBM research, while in 2011 he also won the VLDB 2011 Challenges and Visions best paper award.
Fast learning algorithms for discovering the hidden structure in data
Wednesday, April 10, 2013
ABSTRACT: A major challenge in machine learning is to reliably and automatically discover hidden structure in data with minimal human intervention. For instance, one may be interested in understanding the stratification of a population into subgroups, the thematic make-up of a collection of documents, or the dynamical process governing a complex time series. Many of the core statistical estimation problems for these applications are, in general, provably intractable for both computational and statistical reasons; and therefore progress is made by shifting the focus to realistic instances that rule out the intractable cases. In this talk, I’ll describe a general computational approach for correctly estimating a wide class of statistical models, including Gaussian mixture models, Hidden Markov models, Latent Dirichlet Allocation, Probabilistic Context Free Grammars, and several more. The key idea is to exploit the structure of low-order correlations that is present in high-dimensional data. The scope of the new approach extends beyond the purview of previous algorithms; and it leads to both new theoretical guarantees for unsupervised machine learning, as well as fast and practical algorithms for large-scale data analysis.
BIOGRAPHY: Daniel Hsu is a postdoc at Microsoft Research New England. Previously, he was a postdoc with the Department of Statistics at Rutgers University and the Department of Statistics at the University of Pennsylvania from 2010 to 2011, supervised by Tong Zhang and Sham M. Kakade. He received his Ph.D. in Computer Science in 2010 from UC San Diego, where he was advised by Sanjoy Dasgupta; and his B.S. in Computer Science and Engineering in 2004 from UC Berkeley. His research interests are in algorithmic statistics and machine learning.