Three join Columbia's Computer Science Department

This year, the Computer Science department welcomed three new faculty.

Alexandr Andoni | Advancing the algorithmic foundations of massive data

Alexandr Andoni, Associate Professor of Computer Science

The impact of huge data sets is hard to understate, responsible for advancing almost every scientific and technological field, from machine learning and personalized medicine, to speech recognition and translation.

The flip side of the data revolution is that massive data has rendered many standard algorithms computationally expensive, necessitating a wholesale rethinking of even the most basic computational methods. To take one example: Nearest neighbors search, the classic method since the 70s for finding similarity among objects, is straightforward for small data sets: compare every object to every other one and compute their similarity for each pair. But as the number of objects increases into the billions, computing time grows quadratically, making the task prohibitively expensive, at least in terms of traditional expectations.

Alexandr Andoni, a theoretical computer scientist focused on developing algorithmic foundations for massive data, sees the need to reframe the issue: "The question today is not 'what can we solve in polynomial time?' but 'what is possible in time proportional to data size, or even less?' With all this data, what is the best we can do given the resources? Fortunately, vast improvements are possible in both theory and practice once we settle for approximate answers."

Rather than searching an entire data set for the single most nearest neighbor, a search would go much faster if objects were pre-grouped according to some shared attribute, making it easy to zero in on just the small subset of objects most likely to contain the most similar neighbor. The new challenge then becomes: what attribute shall we use to make such a pre-grouping maximally efficient. The speed-up thus gained reverberates across a wide range of computational methods since nearest neighbors search is ubiquitous and serves as a primitive in higher-level algorithms, particularly in machine learning.

More generally, in the same spirit of relying on (approximate) attributes to speed operations, Andoni has developed a theory of sketching that represents complex objects by smaller, simpler "sketches" that capture the main structure and essential properties of the original objects yet use less (sublinear) space and time to compute. For many tasks, such as estimating similarity of a pair of objects, a sketch may work just as well as a fully realized object. While relaxing strict formulations is happening generally throughout the community in most part by necessity, Andoni is carrying the idea further and is in the forefront of those inventing new primitives and new data structures that explicitly incorporate the concept of sketches.

In early work applying a sketch primitive (Locality Sensitive Hashing) to nearest neighbor search, Andoni in 2006 with Piotr Indyk was able, for the most basic Euclidean distances, to improve over a seminal 1998 algorithm widely used for classification. The Communications of the ACM later (2008, vol. 51) hailed the new primitive as a breakthrough technology that allowed researchers to revisit decades-old problems and solve them faster. Few expected more progress to be possible. Yet when Andoni again revisited the problem (in papers published in 2014 and in 2015), he unexpectedly made more headway, this time by using data-dependent, rather than random, hash functions, a novel idea not previously conceptualized as something useful for improving algorithms.

More seems to be possible by continually re-examining and applying fresh perspectives to classic, well-studied problems. It's one reason, Andoni is returning to academia after four years at Microsoft Research Silicon Valley, which closed in 2014. One point in Columbia's favor was the Data Science Institute (DSI) where exceptional researchers from across the university and diverse disciplines study different aspects of data.

Says Kathy McKeown, director of the DSI: "We're thrilled to have Alex join the Data Science Institute. His research is directly relevant to big data applications where the ability to provide computationally efficient solutions depends on the development of new algorithms."

Andoni also looks for inspiration from students. "You feel the new energy. Students are excited, and that excitement and enthusiasm is invigorating. It leads you to think about even things you've already checked off, to believe there might be new ways of doing things. You want to try again."

B.S., Computer Science and Engineering and B.S., in Mathematics, Massachusetts Institute of Technology, 2004;
M.S., Engineering in Electrical Engineering and Computer Science, 2005;
Ph.D., Computer Science, Massachusetts Institute of Technology, 2009;
Microsoft Research Silicon Valley, 2010-2014;
Visiting scientist, Simons Institute for the Theory of Computing, 2014

Suman Jana | Protecting security and privacy in an age of perceptual computing

Suman Jana, Assistant professor

"I like to break things. To open a thing and see its inner workings and really understand how it works . . . and then break it in some clever way that maybe no one else thought of and in the worst possible way that doesn't seem possible, but is."

It's a hacker mindset, and Suman Jana is a hacker—of a sort. Though he admits enjoying the thrill of destruction and subversion, Jana wants to make systems more secure, and companies have hired him to find security flaws in their systems so those flaws can be fixed, not exploited. This semester he joins the Computer Science department to more broadly research issues related to security and privacy. It's a wide-open field.

Two years ago, for his thesis he looked hard at the security risks inherent in perceptual computing, where devices equipped with cameras, microphones, and sensors are able to perceive the world around them so they can operate and interact more intelligently: lights that dim when a person leaves the room, games that react to a player's throwing motion, doors that unlock when recognizing the owner.

It all comes at a cost, of course, especially in terms of privacy and security.

"Features don't come for free; they require incredible amounts of data. And that brings risks. The same data that tells the thermostat no one is home might also be telling a would-be burglar," says Jana.

What data is being collecting isn't always known, even by the device manufacturers who, pursuing features, default to collecting as much data as they can. This data is handed off to gaming, health, home monitoring, and other apps: not all are trusted; all are possible hacking targets.

There is no opting out. The inexorable trend is toward more perception in devices and more data collection, with privacy and security secondary considerations. For this reason, Jana sees the need for built-in privacy protections. A paper he co-authored, A Scanner Darkly, shows how privacy protection might work in an age of perceptual computing.

"Can we disguise some data? Should a camera for detecting hand gestures also read labels on prescription medicines inadvertently left in the camera's view? Would an app work just as well if it detected approximate contours of the hand? If so, we can pass on lower-resolution data so the prescription label isn't readable."

Jana's opinion is that users should decide what data apps are able to see. His DARKLY platform—named after a dystopian novel by Philip K. Dick—inserts a privacy protection layer that intercepts the data apps receive and displays it in a console so users can see and, if they want, limit how much data is passed on to the app. The platform, which integrates with the popular computer vision library OpenCV, is designed to make it easy for companies to implement and requires no changes to apps. The DARKLY paper, called revolutionary, won the 2014 PET Award for Outstanding Research in Privacy Enhancing Technologies.

Less data for more privacy

Making it easy to build in safety and privacy mechanisms is critical. Manufacturers have little incentive to construct privacy protections; in any case, determining what data is sensitive is not easy. A single data point by itself—a random security photo of a passerby, for instance—might seem harmless, but combined with another data point or aggregated over time—similar photos over several weeks—reveals patterns and personal behaviors. The challenge to preventing security leaks is first finding them amidst a deluge of data; a single image of a prescription label or credit card might be hidden within entire sequences of images.

Perceptual computing is rife with other such vulnerabilities made possible by devices that see, hear, and sense what goes on in their immediate environment; but the landscape of vulnerabilities is even larger than it would appear since perceptual computing, while creating new vulnerabilities, inherits all the old ones associated with any software, namely buggy code. While Jana gets a bigger space to explore, for the rest of us, it spells potential privacy disaster.

Preventing such an outcome will come from enlisting help from other technology experts. "For finding images with hidden personal information, we need classifiers. Machine learners over the years have learned how to train classifiers to recognize spam, recommend movies, and target ads. Why not train classifiers to find security risks? I'm fortunate to be working within Columbia's Data Science Institute alongside machine learners who can build such classifiers." For guarding against buggy code, Jana imagines adapting program analysis, an existing technology for automatically finding software bugs, so it specifically searches out those bugs that concern security and privacy.

Technology alone, however, isn't the answer. Companies are unlikely to fix privacy problems unless pressured by the public, and Jana sees his role encompassing the policy arena, where he will work to propose and enact workable regulations and legislation to protect data and security.

At least for perceptual computing, Jana says there an opening to do something about privacy risks. "The field is still relatively new, and we have the chance to build in security from the beginning and make life better so people can trust these devices and use them."

Jana earned his PhD in 2014 from the University of Texas,
where he was supported by the Google PhD Fellowship

Eugene Wu | Fast, accurate enough for the human in the loop: Visualizing and interacting with big data sets

Eugene Wu, Assistant Professor of Computer Science

For exploring complex data sets, nothing matches the power of interactive visualizations that let people directly manipulate data and arrange it in new ways. Unfortunately, that level of interactivity is not yet possible for massive data sets.

"Computing power has grown, data sets have grown, what hasn't kept pace is the ability to visualize and interact with all this data in a way that's easy and intuitive for people to understand," says Eugene Wu, who recently received his PhD from MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL), where he was a member of the database group.

Speed is one important component for visualizing data, but there are others, such as the ease with which interactive visualizations can be created and the ability to help understand what the results actually say. For his PhD thesis, Wu tackled the latter problem by developing a visualization tool that automatically generates explanations for anomalies in a user's visualization. This is important because while visualizations are very good at showing what's happening in the data, they are not good at explaining why. A visualization might show that company expenses shot up 400% in a single month, and an analyst would naturally want to understand what types of expenditures are responsible. However, the monthly statistic is often computed from thousands or millions of input data points, and identifying a simple description of the exact subset causing the spike (e.g., California shops overspent their budgets) requires laborious, error-prone effort.

Now starting at Columbia, Wu is broadening the scope of his research and is among the first looking at the challenging problems in the overlap between databases and how people want to interact with and visualize the data in those databases. Visualization systems currently being built must take an all-or-nothing approach. "You either get performance for small data sets using a small set of fixed interactions, or you get full expressiveness with SQL and queries but you have to wait and give up interactivity."

Part of the problem is that the database and the visualization communities have traditionally been separate, with the database side focusing on efficient query processing and accuracy, and the visualization community focusing on usability and interactions. Says Wu, "If you look at visualizations from a database perspective, a lot of it looks like database operations. In both cases, you're computing sums, you're computing common aggregates. We can remove many of the perceived differences between databases and visualization systems." Wu wants to bridge the two sides to operate more closely together so both consider first the expectations and requirements of the human in the loop.

For instance, what does database accuracy mean when a human analyst can't differentiate 3.4 from 3.45 in a scatterplot? A slight relaxation of accuracy requirements—unnoticeable to users—would conserve resources while speeding up query operations. In understanding the boundary between what a human can perceive and what amounts to wasted computations, Wu hopes to develop models of human perception that are both faithful to studies in the Human Computer Interaction and Psychology literatures, and applicable to database and visualization system performance.

On the visualization side, less attention has been paid to the programming languages (like JavaScript) used to construct the visualizations; consequently, visualizations are hard to write, to debug, and even harder to scale. A similar situation once prevailed in the database world, where application developers wrote complex and brittle code to fetch data from their databases; but the invention of SQL, a high-level, declarative language, made it easier for developers to express relationships within the data without having to worry about the underlying data representations, paving the way towards today's ubiquitous use of data.

For Wu, the natural progression is to extend the declarative approach to interactive visualizations. With colleagues at Berkeley and University of Washington, Wu is designing a declarative visualization language to provide a set of logical operations and mappings that would free programmers from implementation details so they can logically state what they want while letting the database figure out the best way to do it.

A declarative language for visualization would have additional positive benefits. "Once you have a high-level language capable of expressing analyses, all of these analysis tools such as the explanatory analysis from my thesis is in a sense baked into whatever you build; it comes for free. There will be less need for individuals to write their own ad hoc analysis programs."

As interactions become portable and sharable, they can be copied and pasted from one interactive visualization to another for someone else to modify. And it becomes easier to build tools, which fits with Wu's focus in making data accessible and understandable to all users.

"When a diverse group of people look at the same data, the questions you get are more interesting than if just other computer scientists or business people are asking questions." One of the attractions for Wu in coming to Columbia is the chance to work within the Data Science Institute and collaborate with researchers from across the university, all sharing ideas on new ways to investigate data. "Columbia has a huge range of leaders in nearly every discipline from Journalism, to Bioinformatics to Government studies. Our use of data is ultimately driven by the applications built on top, and I'm excited about working on research that can help improve and benefit from the depth and breath of research at the university."

B.S., Electrical Engineering and Computer Science, UC Berkeley, 2007;
M.S. and Ph.D., Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 2011 and 2014 respectively

- Linda Crane
Posted Fall 2015

SEAS Department Website