Papers from the Wu Lab Accepted to VLDB 2023

Four papers from the Wu Lab were presented at the 49th International Conference on Very Large Data Bases (VLDB 2023). VLDB features research talks, tutorials, demonstrations, and workshops on issues in data management, database, and information systems research.

JoinBoost: Grow Trees Over Normalized Data Using Only SQL
Zezhou Huang, Rathijit Sen, Jiaxiang Liu, Eugene Wu

Although dominant for tabular data, ML libraries that train tree models over normalized databases (e.g., LightGBM, XGBoost) require the data to be denormalized as a single table, materialized, and exported. This process is not scalable, slow, and poses security risks. In-DB ML aims to train models within DBMSes to avoid data movement and provide data governance. Rather than modify a DBMS to support In-DB ML, is it possible to offer competitive tree training performance to specialized ML libraries…with only SQL?

We present JoinBoost, a Python library that rewrites tree training algorithms over normalized databases into pure SQL. It is portable to any DBMS, offers performance competitive with specialized ML libraries, and scales with the underlying DBMS capabilities. JoinBoost extends prior work from both algorithmic and systems perspectives. Algorithmically, we support factorized gradient boosting, by updating theYvariable to the residual in the non-materialized join result. Although this view update problem is generally ambiguous, we identify addition-to-multiplication preserving, the key property of variance semi-ring to support rmse, the most widely used criterion. System-wise, we identify residual updates as a performance bottleneck. Such overhead can be natively minimized on columnar DBMSes by creating a new column of residual values and adding it as a projection. We validate this with two implementations on DuckDB, with no or minimal modifications to its internals for portability. Our experiment shows that JoinBoost is 3x (1.1x) faster for random forests (gradient boosting) compared to LightGBM, and over an order magnitude faster than state-of-the-art In-DB ML systems. Further, JoinBoost scales well beyond LightGBM in terms of the # features, DB size (TPC-DS SF=1000), and join graph complexity (galaxy schemas).

Saibot: A Differentially Private Data Search Platform
Zezhou Huang, Jiaxiang Liu, Daniel Alabi, Raul Castro Fernandez, Eugene Wu

Recent data search platforms use ML task-based utility measures rather than metadata-based keywords, to search large dataset corpora. Requesters submit a training dataset and these platforms search for augmentations (join or union compatible datasets) that, when used to augment the requester’s dataset, most improve model (e.g., linear regression) performance. Although effective, providers that manage personally identifiable data demand differential privacy (DP) guarantees before granting these platforms data access. Unfortunately, making data search differentially private is nontrivial, as a single search can involve training and evaluating datasets hundreds or thousands of times, quickly depleting privacy budgets.

We present Saibot, a differentially private data search platform that employs Factorized Privacy Mechanism (FPM), a novel DP mechanism, to calculate sufficient semi-ring statistics for ML over different combinations of datasets. These statistics are privatized once, and can be freely reused for the search. This allows Saibot to scale to arbitrary numbers of datasets and requests, while minimizing the amount that DP noise affects search results. We optimize the sensitivity of FPM for common augmentation operations, and analyze its properties with respect to linear regression. Specifically, we develop an unbiased estimator for many-to-many joins, prove its bounds, and develop an optimization to redistribute DP noise to minimize the impact on the model. Our evaluation on a real-world dataset corpus of 329 datasets demonstrates that Saibot can return augmentations that achieve model accuracy within 50 to 90% of non-private search, while the leading alternative DP mechanisms (TPM, APM, shuffling) are several orders of magnitude worse.

Pollock: A Data Loading Benchmark
Gerardo Vitagliano, Mazhar Hameed, Lan Jiang, Lucas Reisener, Eugene Wu, Felix Naumann

Any system at play in a data-driven project has a fundamental requirement: the ability to load data. The de-facto standard format to distribute and consume raw data is csv. Yet, the plain text and flexible nature of this format make such files often difficult to parse and correctly load their content, requiring cumbersome data preparation steps.

We propose a benchmark to assess the robustness of systems in loading data from non-standard csv formats and with structural inconsistencies. First, we formalize a model to describe the issues that affect real-world files and use it to derive a systematic “pollution” process to generate dialects for any given grammar. Our benchmark leverages the pollution framework for the csv format. To guide pollution, we have surveyed thousands of real-world, publicly available csv files, recording the problems we encountered. We demonstrate the applicability of our benchmark by testing and scoring 16 different systems: popular csv parsing frameworks, relational database tools, spreadsheet systems, and a data visualization tool.

ConnectorX: Accelerating Data Loading From Databases to Dataframes
Xiaoying Wang, Weiyuan Wu, Jinze Wu, Yizhou Chen, Nick Zrymiak, Changbo Qu, Lampros Flokas, George Chow, Jiannan Wang, Tianzheng Wang, Eugene Wu, Qingqing Zhou

Data is often stored in a database management system (DBMS) but dataframe libraries are widely used among data scientists. An important but challenging problem is how to bridge the gap between databases and dataframes. To solve this problem, we present ConnectorX, a client library that enables fast and memory-efficient data loading from various databases to different dataframes.

We first investigate why the loading process is slow and consumes large memory. We surprisingly find that the main overhead comes from the client-side rather than query execution or data transfer. We integrate several existing and new techniques to reduce the overhead and carefully design the system architecture and interface to make ConnectorX easy to extend to various databases and dataframes. Moreover, we propose server-side result partitioning that can be adopted by DBMSs in order to better support exporting data to data science tools. We conduct extensive experiments to evaluate ConnectorX and compare it with popular libraries. The results show that ConnectorX significantly outperforms existing solutions. ConnectorX is open sourced at: https://github.com/sfu-db/connector-x.

Computer Security Pioneer Steve Bellovin First to Win Two USENIX Flame Awards

Bellovin shares his second lifetime award with Tufts’ Susan Landau and Georgetown’s Matt Blaze for their work on computer science, computer security, law, and public policy.

In Memoriam: Stephen H. Unger

Columbia Engineering mourns the passing of Stephen H. Unger, Professor Emeritus of Computer Science and Electrical Engineering at Columbia University. He passed away on July 4, 2023. Unger was 92 years old.

A pioneer in the fields of logic circuit design, software engineering, and technology policy, Unger worked at Bell Telephone Laboratories, where he developed software tools for the first electronic telephone switching system.

In 1961, he left Bell Labs to teach courses on technology and society at the Electrical Engineering Department at Columbia Engineering until his retirement in 2008. He was one of three tenured professors who joined the newly formed Computer Science Department in 1979, along with Theodore Bashkow from Electrical Engineering and Jonathan Gross from the Mathematical Statistics Department.

HPIM0082.JPG

Together with Professor Emeritus Steven Nowick and Professor Charles A. Zukowski of Electrical Engineering, they founded the Computer Engineering program in 1993. The program is joint between CS and EE departments and offers undergraduate and MS degrees. Unger also served as Department Chair of the program for several years.

A prolific researcher and writer, he is credited as one of the founders of the theory of asynchronous circuits. He authored the definitive early textbook Asynchronous Sequential Switching Circuits (1969) and The Essence of Logic Circuits (1989), which covers logic circuits’ fundamentals and applications.

In joint work with M.C. Paull, their paper “Minimizing the Number of States in Incompletely Specified Sequential Switching Functions” addressed one of the most challenging early digital design optimization problems, and produced a novel solution framework. This work was influential, opening the way to research on a host of advanced digital CAD (computer-aided design) problems.

Unger’s 1958 paper “A Computer Oriented Toward Spatial Problems” is one of the seminal early contributions to parallel computers. This foundational work first introduced the idea of using a spatial array of processors, all operating under the same instructions but on different data items. Such a SIMD (single-instruction multiple-data) style architecture is now a foundation of a large segment of the parallel computing industry.

Unger was a Fellow of the IEEE and the AAAS and received several awards for his contributions to the profession and society. In 1969, he helped found and later became president of the IEEE Society on Social Implications of Technology, which deals with the ethical and social issues related to technology. He also played a principal role in the development of the original IEEE Ethics Code and its 1990 revision, which provides guidelines for engineers to act responsibly and ethically in their profession.

Throughout his career, he was a respected and influential figure in the field of computer science and engineering ethics. Unger received many awards and honors for his work, such as the IEEE Centennial Medal, the IEEE USAB Distinguished Contributions to Engineering Professionalism Award, the IEEE Millennium Medal, and the Guggenheim Fellowship. Even in retirement, he continued to share his opinions on ethics and a variety of topics on his Ends and Means blog.

Unger earned a master’s degree and PhD in electrical engineering from the Massachusetts Institute of Technology. He received his electrical engineering degree from the Polytechnic Institute of Brooklyn (now the New York University Tandon School of Engineering) and graduated from the Brooklyn Technical High School.

Tributes From CS Faculty

Steven Nowick
Steve Unger taught me in his Computer Organization course at Columbia in 1986, when I was a non-degree special student, before going off for my PhD at Stanford. He was instrumental in hiring me as an assistant professor in the Columbia CS department in 1993.

At Columbia, we ran a joint research seminar for many years, engaging closely with each other’s students and exploring new research directions. I greatly enjoyed our interactions and his insights and creativity in approaching new problems. Even in areas he hadn’t worked on, he “cut to the core” quickly, with provocative questions and suggestions on new directions.

Steve was an inspiring mentor, colleague, and friend to me over many years. He made major contributions to research and education at Columbia. I valued our many years working together and was deeply influenced by his approach to research, teaching, and life. He will be missed.

John Kender
Steve had strongly held and often flamboyantly defended opinions. A few of them that I remember:

For many years, he was in charge of CS MS admissions, back when it could be done by one person unassisted. He was a zealous enforcer of the checklist of eight prerequisite CS courses, more than half of which were 4000-level courses required for the BS (for example, AI and PLT). He would admit students in deficit, but they would have to take those courses without MS credit. He also demanded that the MS degree require four 6000-level courses, as in the EE MS program. But because of CS manpower issues in those early days, it was cut back to three, then later two. Throughout, he insisted on defending a clear distinction between the BS and MS, until he was eventually assigned a different service responsibility.

He was a fierce opponent of the Columbia Video Network, concerned that it denied the importance of faculty-student contact and that it enabled students to cheat their way to an MS. This is back when CVN had only students from two industrial affiliates, IBM and Bell Labs, and when courses were literally taped and copies on VHS cassettes were priority-mailed offsite. I do not recall any vote of his in favor of any CVN enhancement, ever.

He gave a series of talks in the early to mid-1980s against the Reagan “Star Wars” Strategic Defense Initiative for a nationwide missile-defense system. He filled the largest CS classroom then, 535 Mudd (before it was split in half). I recall the intensity of his talks, audiences, and colorful examples. One in particular: “People say, if we can put a man on the moon, why can’t we build this? Well, the moon isn’t surrounded by decoys! The moon did not take evasive action!” The eventual defeat of the program was perhaps his clearest win.

More personally speaking, he lived in New Jersey in Englewood, the borough next to mine. I remember asking him why he hadn’t chosen Leonia instead, which at one point was home to five CS profs. He said he chose Englewood because it was the most racially integrated nearby borough and that he felt he should practice equality as much as preach it.

And the incident I remember that most clearly captured his style in a single sentence. He announced to the then-traditional “Hello Meeting,” where the entire department assembled together in one room in early Fall to introduce each other: “Despite the total absence of rumors to the contrary, I have now remarried!”

Even if you disagreed with him–and often many did, some reflexively–he was informed and articulate enough to leave you thinking. He enjoyed his tenure and wasn’t shy about using it for what he perceived to be the public good of students and of society at large.

Donald Ferguson
I remember Professor Unger from my time in the PhD program at Columbia. Professor Unger was one of the foundation stones of the department. I always admired and appreciated his focus on technology’s social, political, and economic impacts. I vividly remember Professor Unger thoughtfully leading a department colloquium to discuss President Regan’s Strategic Defense Initiative. His leadership nudged the world in a thoughtful direction.

Vishal Misra
Steve was a great guy – the first course I taught at Columbia was digital systems, and he was very generous to me with all his notes and time to help a nervous young assistant professor get by.

Salvatore Stolfo
He was a wonderful man with quirks that are embedded in my memories of him. Besides being a principled and honorable man, he was amazingly adept at peeling an apple at faculty meetings that sometimes seemed longer than a 2-foot-long peeling in one piece. He smiled when he completed his task before eating the skinned apple. Little pleasures appealed to him.

Simha Sethumadhavan
He retired the year I started. But his ethics articles, which he continued to share with the faculty on the mailing list even after retirement, were original and insightful.

He also made foundational contributions to Computer Architecture. I recall that his 1958 paper was cited in the first or second edition of the standard graduate computer architecture textbook “Computer Architecture: A Quantitative Approach” by John Hennessy and David Patterson as the beginning of the “single instruction multiple data” execution paradigm.

The paradigm he proposed was used in supercomputers (CRAY) in the late 60s/early 70s, found its way into Intel processors in the mid-90s (Multimedia extensions – MMX) and then into image processing systems on phones in the early 2000s, and informed the “Single Instruction Multiple *Thread*” paradigm” that powers all GPUs today.

I was unable to find the old edition to confirm the citation in the textbook, but a survey from 1999 “Managing Control Asynchrony on SIMD Machines-a Survey” by Nael Abu-Ghazaleh and Philip Wilsey says the following:

As a historical aside, SIMD machines were first suggested by Unger (Unger, 1958). The first machines to be designed were the SOLOMON (Slotnick et al., 1962) at Westinghouse and the Illiac at the University of Illinois, which also was the first SIMD machine to be built (Barnes et al., 1968).

Year: 2023