Eugene Wu | Fast, accurate enough for the human in the loop: Visualizing and interacting with big data sets
For exploring complex data sets, nothing matches the power of interactive visualizations that let people directly manipulate data and arrange it in new ways. Unfortunately, that level of interactivity is not yet possible for massive data sets.
“Computing power has grown, data sets have grown, what hasn’t kept pace is the ability to visualize and interact with all this data in a way that’s easy and intuitive for people to understand,” says Eugene Wu, who recently received his PhD from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), where he was a member of the database group.
Speed is one important component for visualizing data, but there are others, such as the ease with which interactive visualizations can be created and the ability to help understand what the results actually say. For his PhD thesis, Wu tackled the latter problem by developing a visualization tool that automatically generates explanations for anomalies in a user’s visualization. This is important because while visualizations are very good at showing what’s happening in the data, they are not good at explaining why. A visualization might show that company expenses shot up 400% in a single month, and an analyst would naturally want to understand what types of expenditures are responsible.
However, the monthly statistic is often computed from thousands or millions of input data points, and identifying a simple description of the exact subset causing the spike (e.g., California shops overspent their budgets) requires laborious, error-prone effort.
Now starting at Columbia, Wu is broadening the scope of his research and is among the first looking at the challenging problems in the overlap between databases and how people want to interact with and visualize the data in those databases. Visualization systems currently being built must take an all-or-nothing approach. “You either get performance for small data sets using a small set of fixed interactions, or you get full expressiveness with SQL and queries but you have to wait and give up interactivity.”
Part of the problem is that the database and the visualization communities have traditionally been separate, with the database side focusing on efficient query processing and accuracy, and the visualization community focusing on usability and interactions. Says Wu, “If you look at visualizations from a database perspective, a lot of it looks like database operations. In both cases, you’re computing sums, you’re computing common aggregates. We can remove many of the perceived differences between databases and visualization systems.” Wu wants to bridge the two sides to operate more closely together so both consider first the expectations and requirements of the human in the loop.
For instance, what does database accuracy mean when a human analyst can’t differentiate 3.4 from 3.45 in a scatterplot? A slight relaxation of accuracy requirements—unnoticeable to users—would conserve resources while speeding up query operations. In understanding the boundary between what a human can perceive and what amounts to wasted computations, Wu hopes to develop models of human perception that are both faithful to studies in the Human Computer Interaction and Psychology literatures, and applicable to database and visualization system performance.
For Wu, the natural progression is to extend the declarative approach to interactive visualizations. With colleagues at Berkeley and University of Washington, Wu is designing a declarative visualization language to provide a set of logical operations and mappings that would free programmers from implementation details so they can logically state what they want while letting the database figure out the best way to do it.
A declarative language for visualization would have additional positive benefits. “Once you have a high-level language capable of expressing analyses, all of these analysis tools such as the explanatory analysis from my thesis is in a sense baked into whatever you build; it comes for free. There will be less need for individuals to write their own ad hoc analysis programs.”
As interactions become portable and sharable, they can be copied and pasted from one interactive visualization to another for someone else to modify. And it becomes easier to build tools, which fits with Wu’s focus in making data accessible and understandable to all users.
“When a diverse group of people look at the same data, the questions you get are more interesting than if just other computer scientists or business people are asking questions.” One of the attractions for Wu in coming to Columbia is the chance to work within the Data Science Institute and collaborate with researchers from across the university, all sharing ideas on new ways to investigate data. “Columbia has a huge range of leaders in nearly every discipline from Journalism, to Bioinformatics to Government studies. Our use of data is ultimately driven by the applications built on top, and I’m excited about working on research that can help improve and benefit from the depth and breath of research at the university.”
B.S., Electrical Engineering and Computer Science, UC Berkeley, 2007; M.S. and Ph.D., Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 2011 and 2014 respectively
– Linda Crane
The Columbia Engineering community has come together to combat the coronavirus pandemic on multiple fronts. In close collabo-ration with the Columbia University Irving Medical Center, we’re leveraging our expertise and innovation to address short term medical needs and long term societal impacts.
Dean Boyce's statement on amicus brief filed by President Bollinger
President Bollinger announced that Columbia University along with many other academic institutions (sixteen, including all Ivy League universities) filed an amicus brief in the U.S. District Court for the Eastern District of New York challenging the Executive Order regarding immigrants from seven designated countries and refugees. Among other things, the brief asserts that “safety and security concerns can be addressed in a manner that is consistent with the values America has always stood for, including the free flow of ideas and people across borders and the welcoming of immigrants to our universities.”
This recent action provides a moment for us to collectively reflect on our community within Columbia Engineering and the importance of our commitment to maintaining an open and welcoming community for all students, faculty, researchers and administrative staff. As a School of Engineering and Applied Science, we are fortunate to attract students and faculty from diverse backgrounds, from across the country, and from around the world. It is a great benefit to be able to gather engineers and scientists of so many different perspectives and talents – all with a commitment to learning, a focus on pushing the frontiers of knowledge and discovery, and with a passion for translating our work to impact humanity.
I am proud of our community, and wish to take this opportunity to reinforce our collective commitment to maintaining an open and collegial environment. We are fortunate to have the privilege to learn from one another, and to study, work, and live together in such a dynamic and vibrant place as Columbia.
Mary C. Boyce
Dean of Engineering
Morris A. and Alma Schapiro Professor