Eugene Wu | Fast, accurate enough for the human in the loop: Visualizing and interacting with big data sets
For exploring complex data sets, nothing matches the power of interactive visualizations that let people directly manipulate data and arrange it in new ways. Unfortunately, that level of interactivity is not yet possible for massive data sets.
"Computing power has grown, data sets have grown, what hasn't kept pace is the ability to visualize and interact with all this data in a way that's easy and intuitive for people to understand," says Eugene Wu, who recently received his PhD from MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL), where he was a member of the database group.
Speed is one important component for visualizing data, but there are others, such as the ease with which interactive visualizations can be created and the ability to help understand what the results actually say. For his PhD thesis, Wu tackled the latter problem by developing a visualization tool that automatically generates explanations for anomalies in a user's visualization. This is important because while visualizations are very good at showing what's happening in the data, they are not good at explaining why. A visualization might show that company expenses shot up 400% in a single month, and an analyst would naturally want to understand what types of expenditures are responsible. However, the monthly statistic is often computed from thousands or millions of input data points, and identifying a simple description of the exact subset causing the spike (e.g., California shops overspent their budgets) requires laborious, error-prone effort.
Now starting at Columbia, Wu is broadening the scope of his research and is among the first looking at the challenging problems in the overlap between databases and how people want to interact with and visualize the data in those databases. Visualization systems currently being built must take an all-or-nothing approach. "You either get performance for small data sets using a small set of fixed interactions, or you get full expressiveness with SQL and queries but you have to wait and give up interactivity."
Part of the problem is that the database and the visualization communities have traditionally been separate, with the database side focusing on efficient query processing and accuracy, and the visualization community focusing on usability and interactions. Says Wu, "If you look at visualizations from a database perspective, a lot of it looks like database operations. In both cases, you're computing sums, you're computing common aggregates. We can remove many of the perceived differences between databases and visualization systems." Wu wants to bridge the two sides to operate more closely together so both consider first the expectations and requirements of the human in the loop.
For instance, what does database accuracy mean when a human analyst can't differentiate 3.4 from 3.45 in a scatterplot? A slight relaxation of accuracy requirements—unnoticeable to users—would conserve resources while speeding up query operations. In understanding the boundary between what a human can perceive and what amounts to wasted computations, Wu hopes to develop models of human perception that are both faithful to studies in the Human Computer Interaction and Psychology literatures, and applicable to database and visualization system performance.
For Wu, the natural progression is to extend the declarative approach to interactive visualizations. With colleagues at Berkeley and University of Washington, Wu is designing a declarative visualization language to provide a set of logical operations and mappings that would free programmers from implementation details so they can logically state what they want while letting the database figure out the best way to do it.
A declarative language for visualization would have additional positive benefits. "Once you have a high-level language capable of expressing analyses, all of these analysis tools such as the explanatory analysis from my thesis is in a sense baked into whatever you build; it comes for free. There will be less need for individuals to write their own ad hoc analysis programs."
As interactions become portable and sharable, they can be copied and pasted from one interactive visualization to another for someone else to modify. And it becomes easier to build tools, which fits with Wu's focus in making data accessible and understandable to all users.
"When a diverse group of people look at the same data, the questions you get are more interesting than if just other computer scientists or business people are asking questions." One of the attractions for Wu in coming to Columbia is the chance to work within the Data Science Institute and collaborate with researchers from across the university, all sharing ideas on new ways to investigate data. "Columbia has a huge range of leaders in nearly every discipline from Journalism, to Bioinformatics to Government studies. Our use of data is ultimately driven by the applications built on top, and I'm excited about working on research that can help improve and benefit from the depth and breath of research at the university."
- Linda Crane