Four papers from Columbia at Crypto 2018

CS researchers worked with collaborators from numerous universities to explore and develop cryptographic tools. The papers were presented at the 38th Annual International Cryptology Conference.

Correcting Subverted Random Oracles
Authors: Alexander Russell University of Connecticut, Qiang Tang New Jersey Institute of Technology, Moti Yung Columbia University, Hong-Sheng Zhou Virginia Commonwealth University

Many cryptographic operations rely on cryptographic hash functions. The structure-less hash functions behave like what is called a random oracle (random looking public function). The random oracle methodology has proven to be a powerful tool for designing and reasoning about cryptographic schemes, and can often act as an effective bridge between theory and practice.

These functions are used to build blockchains for cryptocurrencies like Bitcoin, or to perform “password hashing” – where a system stores a hashed version of the user’s password to compare against the value that is presented by the user: it is hashed, and compared to the  previously stored computed value in order to identify the user. These functions garble the input to look random, and it is hard from the output to find the input or any other input that maps to the same output result.

The paper is in the area of implementations of cryptographic functions that are subverted (aka implemented incorrectly or maliciously). Shared one of the researchers, Moti Yung from Columbia University,” We deal with a hash function that some entries in the function table are corrupted by the opponent – a bad guy who implemented it differently than the specification. The number of corruption is relatively small to the size of the function. What we explored is what can be done with it by the bad guys and can we protect against such bad guys.”

First, the paper shows devastating attacks are possible with this modification of a function: the bad guys can control the blockchain, or allow an impersonator to log in on behalf of a user, respectively, in the above applications.

Then, to counter the attack the researchers demonstrated how to manipulate the function (apply it a few times and combine the results in different ways) and how the new function prevents attacks.

In this paper, the focus is on the basic problem of correcting faulty—or adversarially corrupted—random oracles, so that they can be confidently applied for such cryptographic purposes.

“The construction is not too complex,” said Yung, a computer science professor. “The proof that it works, on the other hand, required the development of  new techniques that show that the manipulation produces a hash function that behaves like a random oracle in spite of the manipulation by the bad guy.”

Authors: Lucas Kowalczyk Columbia University, Tal Malkin Columbia University, Jonathan Ullman Northeastern University, Daniel Wichs Northeastern University

How can one know if the way they are using data is protecting the privacy of the individuals to whom the data belongs? What does it mean to protect an individual’s privacy exactly? Differential privacy is a mathematical definition of privacy that one can objectively prove one’s usage satisfies. Informally, if data is used in a differentially private manner, then any one individual’s data cannot have significant influence on the output behavior.

For example, Apple uses differentially private data collection to collect information about its users to improve their experience, while now being able to additionally give them a mathematical guarantee about how their information is being used.

Differential privacy is a nice notion, but there is a question: is it always possible to use data in a differentially private manner? “Our paper is an impossibility result,” said Lucas Kowalczyk, a PhD student at Columbia University. “We exhibit a setting, an example of data and desired output behavior, for which we prove that differential privacy is not possible to achieve. The specific setting is constructed using tools from cryptography.”

Proofs of Work From Worst-Case Assumptions
Authors: Marshall Ball Columbia University, Alon Rosen IDC, Herzliya, Manuel Sabin UC Berkeley, Prashant Nalini Vasudevan MIT

A long standing goal in the area is constructing NP-Hard cryptography. This would involve proving a “win-win” theorem of the following form: either some cryptographic scheme is secure or the satisfiability problem can be solved efficiently. The advantage of a theorem like this, is that if the scheme is broken it will yield efficient algorithms for a huge number of interesting problems.

Unfortunately, there hasn’t been much progress beyond uncovering certain technical barriers. In the paper the researchers proved win-wins of the following form: either the cryptographic scheme is secure or there are faster (worst-case) algorithms than are currently known for satisfiability and other problems (violating certain conjectures in complexity theory).

“In particular, the cryptography we construct is something called a Proof of Work,” said Maynard Marshall Ball, a fourth year PhD student. “Roughly, a Proof of Work can be seen as a collection of puzzles such that a random one requires a certain amount of computational “work” to solve, and solutions can be checked with much less work.”

These objects appear to be inherent to cryptocurrencies, but perhaps a simpler and older application comes from spam prevention. To prevent spam, an email server can simply require a proof of work in order to receive an email. In this way, it is easy to solve a single puzzle and send a single email, but solving many puzzles to send many emails is infeasible.

A Simple Obfuscation Scheme for Pattern-Matching with Wildcards
Authors: Allison Bishop Columbia University and IEX, Lucas Kowalczyk Columbia University, Tal Malkin Columbia University, Valerio Pastro Columbia University and Yale University, Mariana Raykova Yale University, Kevin Shi Columbia University

Suppose a software patch is released which addresses a vulnerability triggered by some bad inputs. The question is how to publicly release the patch so that all users can apply it, without revealing which bad inputs trigger the vulnerability – to ensure that an attacker will not take advantage of the vulnerability on unpatched systems. 

This can be addressed by what cryptographers call “program obfuscation” — producing a program that anyone can use, but no one can figure out the internal working of, or any secrets used by the program.  Obfuscation has many important applications, but it is difficult to achieve, and in many cases was shown to be impossible. For cases where obfuscation is possible, existing solutions typically utilize complex and inefficient techniques and rely on cryptographic assumptions that are not well understood.

The researchers discovered a surprisingly simple and efficient construction that can obfuscate programs for pattern-matching with wildcards, motivated for example by the software patching application above.  What this means is that a program can be released to check for any given input whether it matches some specific pattern (e.g., the second bit is 1, the third bit is 0, the sixth bit is 0, etc).  

“Crucially, our construction allows you to publicly release such a program, while keeping the actual pattern secret,” said Tal Malkin, a computer science professor, “So people can use the program and information is kept safe.”

How Can We Keep Genetic Data Safe?

In light of how easy it is to identify people based on their DNA, researchers suggest ways to protect genetic information.


Genetic information uploaded to a website is now used to help identify criminals. This technique, employed by law enforcement to solve the Golden State Killer case, took genetic material from the crime scene and compared it to publicly available genetic information on third party website GEDmatch.

Inspired by how the Golden State Killer was caught, researchers set out to see just how easy it is to identify individuals by searching databases and finding genetic matches through distant relatives. The paper out today in Science Magazine also proposes a way to protect genetic information.

“We want people to discover their genetic data,” said the paper’s lead author, Yaniv Erlich, a computer scientist at Columbia University and Chief Science Officer at MyHeritage, a genealogy and DNA testing company. “But we have to think about how to keep people safe and prevent issues.”

Commercially available genetic tests are increasingly popular and users can opt to have their information used by genetic testing companies. Companies like 23andMe have used customer’s data for research to discover therapeutics and come up with hypothesis to make medicines. People can also upload their genetic information to third party websites, such as GEDmatch and DNA.Land, to find long-lost relatives.

With these scenarios, the data is used for good but what about the opposite? The situation can easily be switched, which could prove harmful for those who work covert operations (aka spies) and need their identities to remain secret.

Erlich shared that it takes roughly a day and a half to sift through a dataset of 1.28 million individuals to identify a third cousin. This is especially true for people of European descent in the United States. Then, based on sex, age and area of residence it is easy to get down to 40 individuals. At that point, the information can be used as an investigative lead.

To alleviate the situation and protect people, the researchers propose that raw data should be cryptographically encrypted and only those with the right key can view and use the data.

“Things are complicated but with the right strategy and policy we can mitigate the risks,” said Erlich.

New algorithms developed by CS researchers at FOCS 2018

Four papers were accepted to the Foundations of Computer Science (FOCS) symposium. CS researchers worked alongside colleagues from various organizations to develop the algorithms.


Learning Sums of Independent Random Variables with Sparse Collective Support

Authors: Anindya De Northwestern University, Philip M. Long Google, Rocco Servedio Columbia University

The paper is about a new algorithm for learning an unknown probability distribution given draws from the distribution.

A simple example of the problem that the paper considers can be illustrated with a penny tossing scenario: Suppose you have a huge jar of pennies, each of which may have a different probability of coming up heads. If you toss all the pennies in the jar, you’ll get some number of heads; how many times do you need to toss all the pennies before you can build a highly accurate model of how many pennies are likely to come up heads each time? Previous work answered this question, giving a highly efficient algorithm to learn this kind of probability distribution.

The current work studies a more difficult scenario, where there can be several different kinds of coins in the jar — for example, it may contain pennies, nickels and quarters. Each time you toss all the coins, you are told the total *value* of the coins that came up heads (but not how many of each type of coin came up heads).

“Something that surprised me is that when we go from only one kind of coin to two kinds of coins, the problem doesn’t get any harder,” said Rocco Servedio, a researcher from Columbia University. “There are algorithms which are basically just as efficient to solve the two-coin problem as the one-coin problem. But we proved that when we go from two to three kinds of coins, the problem provably gets much harder to solve.”

Holder Homeomorphisms and Approximate Nearest Neighbors

Authors: Alexandr Andoni Columbia University, Assaf Naor Princeton University, Aleksandar Nikolov University of Toronto, Ilya Razenshteyn
Microsoft Research Redmond, Erik Waingarten Columbia University

This paper gives new algorithms for the approximate near neighbor (ANN) search problem for general normed spaces. The problem is a classic problem in computational geometry, and a way to model “similarity search”.

For example, Spotify may need to preprocess their dataset of songs so that new users may find songs which are most similar to their favorite songs. While a lot of work goes into designing good algorithms for ANN, these algorithm work for specific metric spaces of interest to measure the distance between two points (such as Euclidean or Manhattan distance). This work is the first to give non-trivial algorithms for general normed spaces.

“One thing which surprised me was that even though the embedding appears weaker than established theorems that are commonly used, such as John’s theorem, the embedding is efficiently computable and gives more control over certain parameters,” said Erik Waingarten, an algorithms and computational complexity PhD student.

Parallel Graph Connectivity in Log Diameter Rounds

Authors: Alexandr Andoni Columbia University, Zhao Song Harvard University & UT-Austin, Clifford Stein Columbia University, Zhengyu Wang Harvard University, Peilin Zhong Columbia University

This paper is about designing fast algorithms for problems on graphs in parallel systems, such as MapReduce. The latter systems have been widely-successful in practice, and invite designing new algorithms for these parallel systems. While many classic “parallel algorithms” were been designed in the 1980s and 1990s (PRAM algorithms), predating the modern massive computing clusters, they had a different parallelism in mind, and hence do not take full advantage of the new systems.

The typical example is the problem of checking connectivity in a graph: given N nodes together with some connecting edges (e.g., a friendship graph or road network), check whether there’s a path between two given nodes. The classic PRAM algorithm solves this in “parallel time” that is logarithmic in N. While already much better than the sequential-time of N (or more), the researchers considered to question whether one can do even better in the new parallel systems a-la MapReduce.

While obtaining a much better runtime seems out of reach at the moment (some conjecture impossible), the researchers realized that they may be able to obtain faster algorithms when the graphs have a small diameter, for example if any two connected nodes have a path at most 10 hops. Their algorithm obtains a parallel time that is logarithmic in the diameter. (Note that the diameter of a graph is often much smaller than N.)

“Checking connectivity may be a basic problem, but its resolution is fundamental to understanding many other problems of graphs, such as shortest paths, clustering, and others,” said Alexandr Andoni, one of the authors. “We are currently exploring these extensions.”

Non-Malleable Codes for Small-Depth Circuits

Authors: Marshall Ball Columbia University, Dana Dachman-Soled University of Maryland, Siyao Guo Northeastern University, Tal Malkin Columbia University, Li-Yang Tan Stanford University

With this paper, the researchers constructed new non-malleable codes that improve efficiency over what was previously known.

“With non-malleable codes, any attempt to tamper with the encoding will do nothing and what an attacker can only hope to do is replace the information with something completely independent,” said Maynard Marshall Ball, a fourth year PhD student. “That said, non-malleable codes cannot exist for arbitrary attackers.”

The constructions were derived via a novel application of a powerful circuit lower bound technique (pseudorandom switching lemmas) to non-malleability. While non-malleability against circuit classes implies strong lower bounds for those same classes, it is not clear that the converse is true in general. This work shows that certain techniques for proving strong circuit lower bounds are indeed strong enough to yield non-malleability.