The Theory Group Wins Big at STOC 2026

CS researchers earned two Best Paper Awards and had 17 papers accepted to the 58th Annual ACM Symposium on Theory of Computing (STOC 2026).

Best Paper Award Winners
Boolean function monotonicity testing requires (almost) n^1/2 queries
Mark Chen, Xi Chen, Hao Cui, William Pires, Jonah Stockwell (Columbia University)

Abstract:
We show that for any constant c > 0, any (two-sided error) adaptive algorithm for testing monotonicity of Boolean functions must have query complexity Ω(n^{1/2 − c}). This improves the Ω(n^1/3) lower bound of Chen, Waingarten, and Xie (2017) and almost matches the Õ(√n) upper bound of Khot, Minzer, and Safra (2018).

Separating QMA from QCMA with a classical oracle
John Bostanci (Columbia University); Jonas Haferkamp (Saarland University); Chinmay Nirkhe (University of Washington); Mark Zhandry (NTT Research & Stanford University)

Abstract:
We construct a classical oracle proving that, in a relativized setting, the set of languages decidable by an efficient quantum verifier with a quantum witness (QMA) is strictly bigger than those decidable with access only to a classical witness (QCMA). The separating classical oracle we construct is for a decision problem we coin spectral Forrelation – the oracle describes two subsets of the boolean hypercube, and the computational task is to decide if there exists a quantum state whose standard basis measurement distribution is well supported on one subset while its Fourier basis measurement distribution is well supported on the other subset. This is equivalent to estimating the spectral norm of a “Forrelation” matrix between two sets that are accessible through membership queries.

Our lower bound derives from a simple observation that a query algorithm with a classical witness can be run multiple times to generate many samples from a distribution, while a quantum witness is a “use once” object. This observation allows us to reduce proving a QCMA lower bound to proving a sampling hardness result which does not simultaneously prove a QMA lower bound. To prove said sampling hardness result for QCMA, we observe that quantum access to the oracle can be compressed by expressing the problem in terms of bosons – a novel “second quantization” perspective on compressed oracle techniques, which may be of independent interest. Using this compressed perspective on the sampling problem, we prove the sampling hardness result, completing the proof.

Conference Papers
A Mysterious Connection Between Tolerant Junta Testing and Agnostically Learning Conjunctions
Xi Chen, Shyamal Patel, Rocco A. Servedio (Columbia University)

Abstract:
The main conceptual contribution of this paper is identifying a previously unnoticed connection between two central problems in computational learning theory and property testing: agnostically learning conjunctions and tolerantly testing juntas. Inspired by this connection, the main technical contribution is a pair of improved algorithms for these two problems.

In more detail:

We give a distribution-free algorithm for agnostically PAC learning conjunctions over {±1}ⁿ that runs in time 2^{Õ(n^1/3)}, for constant excess error ε. This improves on the fastest previously published algorithm, which runs in time 2^{Õ(n^1/2)} [KKMS08].
Building on the ideas in our agnostic conjunction learner and using significant additional technical ingredients, we give an adaptive tolerant testing algorithm for k-juntas that makes 2^{Õ(k^1/3)} queries, for constant “gap parameter” ε between the “near” and “far” cases. This improves on the best previous results, due to [ITW21, NP24], which make 2^Õ(√k) queries. Since there is a known 2^Ω̃(√k) lower bound for non-adaptive tolerant junta testers, our result shows that adaptive tolerant junta testing algorithms provably outperform non-adaptive ones.

Fourier Spectrum of Noisy Quantum Algorithms
Uma Girish (Columbia University)

Abstract:
Quantum computing promises exponential speedups for certain problems, yet fully universal quantum computers remain out of reach and near-term devices are inherently noisy. Motivated by this, we study noisy quantum algorithms and the landscape between BQP and BPP. We build on a powerful technique to differentiate quantum and classical algorithms called the level-ℓ Fourier growth (the sum of absolute values of Fourier coefficients of sets of size ℓ) and show that it can also be used to differentiate quantum algorithms based on the types of resources used. We show that noise acting on a quantum algorithm dampens its Fourier growth in ways intricately linked to the type of noise.

Concretely, we study noisy models of quantum computation where highly mixed states are prevalent, namely:

DQC_k algorithms, where k qubits are clean and the rest are maximally mixed.
1/2-BQP algorithms, where the initial state is maximally mixed, but the algorithm is given knowledge of the initial state at the end of the computation.

We establish upper bounds on the Fourier growth of DQC_k, 1/2-BQP, and BQP algorithms and leverage the differences between these bounds to derive oracle separations between these models. In particular, we show that 2-Forrelation and 3-Forrelation require N^Ω(1) queries in the DQC₁ and 1/2-BQP models, respectively. Our results are proved using a new matrix decomposition lemma that may be of independent interest.

Sparsifying Suprema of Gaussian Processes
Anindya De (University of Pennsylvania); Shivam Nadimpalli (MIT); Ryan O’Donnell (CMU); Rocco A. Servedio (Columbia University)

Abstract:
We give a dimension-independent sparsification result for suprema of centered Gaussian processes. Let T be any (possibly infinite) bounded set of vectors in Rⁿ, and let {X_t := t · g}_t∈T be the canonical Gaussian process on T, where g ∼ N(0, I_n). We show that there is an O_ε(1)-size subset S ⊂ T and a set of real values {c_s}_s∈S such that the random variable sup_s∈S{X_s + c_s} is an ε-approximator (in L¹) of the random variable sup_t∈T X_t. Notably, the size of the sparsifier S is completely independent of both |T| and the ambient dimension n.

We give two applications of this sparsification theorem:

A “Junta Theorem” for Norms: We show that given any norm ν(x) on Rⁿ, there is another norm ψ(x) depending only on the projection of x onto O_ε(1) directions, for which ψ(g) is a multiplicative (1 ± ε)-approximation of ν(g) with probability 1 − ε for g ∼ N(0, I_n).
Sparsification of Convex Sets: We show that any intersection of (possibly infinitely many) halfspaces in Rⁿ that are at distance r from the origin is ε-close (under N(0, I_n) ) to an intersection of only O_r,ε(1) halfspaces. This yields new polynomial-time agnostic learning and tolerant property testing algorithms for intersections of halfspaces.

Quantum circuit lower bounds in the magic hierarchy
Natalie Parham (Columbia University)

Abstract:
We introduce the magic hierarchy, a quantum circuit model that alternates between arbitrary-sized Clifford circuits and constant-depth circuits with two-qubit gates (QNC⁰). This model unifies existing circuit models, such as QAC⁰_f and models with adaptive intermediate measurements. Despite its generality, we are able to prove nontrivial lower bounds.

We prove new lower bounds in the first level of the hierarchy, showing that certain explicit quantum states cannot be approximately prepared by circuits consisting of a Clifford circuit followed by QNC⁰. These states include ground states of some topologically ordered Hamiltonians and nonstabilizer quantum codes. Our techniques exploit the rigid structure of stabilizer codes and introduce an infectiousness property: if even a single state in a high-distance code can be approximately prepared by one of these circuits, then the entire subspace must lie close to a perturbed stabilizer code. We also show that proving state preparation lower bounds beyond a certain level of the hierarchy would imply classical circuit lower bounds beyond the reach of current techniques in complexity theory.

More broadly, our techniques go beyond lightcone-based methods and highlight how the magic hierarchy provides a natural framework for connecting circuit complexity, condensed matter, and Hamiltonian complexity.

Finding Bugs in Short Proofs: The Metamathematics of Resolution Lower Bounds
Jiawei Li (University of Texas at Austin); Yuhao Li (Columbia University); Hanlin Ren (Institute for Advanced Study)

Abstract:
We study the refuter problems for proof complexity lower bounds. Suppose φ is a hard tautology that does not admit any length-s proof in some proof system P. In the corresponding refuter problem, we are given (query access to) a purported length-s proof π in P that claims to have proved φ, and our goal is to find an invalid derivation step within π. As suggested by witnessing theorems in bounded arithmetic, the computational complexity of these refuter problems is closely tied to the metamathematics of the underlying lower bounds.

We focus on refuter problems corresponding to lower bounds for resolution, which is arguably the single most studied system in proof complexity. As a warm-up, we show that many refuter problems for resolution width lower bounds are PLS-complete. To capture the complexity of refuter problems for resolution size lower bounds, we introduce a new class rwPHP(PLS) in decision-tree TFNP, which can be seen as a randomized version of PLS.

We show that the refuter problems for many resolution size lower bounds can be solved in
rwPHP(PLS), including the classic lower bound of Haken [TCS, 1985] for the pigeonhole principle. More generally, we identify a common proof technique that we call “random restriction + width lower bound”, and present strong evidence that resolution lower bounds proved by this technique typically have refuter problems in rwPHP(PLS).
We then show that the refuter problem for any resolution size lower bound is rwPHP(PLS)-hard, thereby demonstrating that the rwPHP(PLS) upper bound mentioned above is tight. Informally speaking, this means that “rwPHP(PLS)-reasoning” is necessary for proving all resolution size lower bounds.

Interpreted in bounded arithmetic, our results show that the theory T¹₂(α) + dwPHP(PV(α)) characterizes the “reasoning power” required to prove (the “easiest”) resolution size lower bounds. As a corollary, we obtain surprisingly efficient proofs of resolution lower bounds. In particular, we show that many resolution size lower bounds can be proved in low-width random resolution [Pudlák–Thapen, CCC’17].

Testing noisy low-degree polynomials for sparsity
Yiqiao Bao, Anindya De (University of Pennsylvania); Shivam Nadimpalli (MIT); Rocco A. Servedio (Columbia University); Nathan White (University of Pennsylvania)

Abstract:
We consider the problem of testing whether an unknown low-degree polynomial p over Rⁿ is sparse versus far from sparse, given access to noisy evaluations of the polynomial p at randomly chosen points. This is a natural property-testing version of various well-studied problems about learning low-degree sparse polynomials in the presence of noise, and is a generalization of the work of Chen, De, and Servedio [CDS20] on testing noisy linear functions for sparsity, to the more challenging setting of low-degree polynomials.

Our main result gives a precise characterization of when sparsity testing for low-degree polynomials can be carried out with constant sample complexity independent of dimension, along with a constant-sample algorithm for this problem in the parameter regime where this is possible.

In more detail, for any mean-zero variance-one finitely supported distribution X over the reals, any degree parameter d, and any sparsity parameters s and T ≥ s, we define a computable function Max-Sparsity-Gap_X,d(·), and:

For T ≥ Max-Sparsity-Gap_X,d(s), we give an
O_s,X,d(1)-sample algorithm for the problem of distinguishing whether a multilinear degree-d polynomial over Rⁿ is s-sparse versus
ε-far from T-sparse, given independent labeled examples
(x, p(x) + noise)_x∼X^⊗n.
Crucially, this sample complexity is completely independent of the ambient dimension n.
For T ≤ Max-Sparsity-Gap_X,d(s) − 1, we show that even in the absence of noise, any algorithm for distinguishing whether a multilinear degree-d polynomial is s-sparse versus ε-far from T-sparse, given independent labeled examples
(x, p(x))_x∼X^⊗n, must use Ω_X,d,s(log n)examples.

Our techniques employ a generalization of the results of Dinur, Friedgut, Kindler, and O’Donnell [DFKO07] on the Fourier tails of bounded functions over {±1}ⁿ to a broad range of finitely supported distributions, which may be of independent interest.

On the Computational Hardness of Transformers
Barna Saha, Yinzhan Xu, Christopher Ye (University of California, San Diego); Hantao Yu (Columbia University)

Abstract:
The transformer architecture has revolutionized modern AI across language, vision, and beyond. It consists of L layers of multi-head attention, where each layer runs H attention heads in parallel and feeds the combined output to the subsequent layer. In attention, each token within an input of length N is represented by an embedding vector of dimension m. Computationally, an attention mechanism primarily involves multiplying three N × m matrices, while applying a softmax operation to the intermediate product of the first two matrices.

A significant body of work has been devoted to analyzing the time complexity of attention, leading to several recent advances. On the other hand, known algorithms for transformers compute each attention head independently. This raises a fundamental question that has recurred throughout theoretical computer science under the guise of “direct sum” problems: can multiple instances of the same problem be solved more efficiently than solving each instance separately? Many answers to this question, both positive and negative, have arisen in fields spanning communication complexity and algorithm design.

Thus, a key challenge in understanding the computational hardness of transformers is to determine whether their computation can be performed more efficiently than LH independent evaluations of attention. In this paper, we resolve this question in the negative and give the first nontrivial computational lower bounds for multi-head multi-layer transformers.

Small embedding regime: When m = N^o(1), computing LH attention heads separately takes LHN^2+o(1) time. We establish that this is essentially optimal under the Strong Exponential Time Hypothesis (SETH).
Large embedding regime: When m = N, one can compute LH attention heads separately using LHN^ω+o(1) arithmetic operations (plus exponents), where ω is the matrix multiplication exponent. We establish that this is optimal by showing that LHN^ω−o(1)arithmetic operations are necessary when ω > 2.

Our lower bound in the large embedding regime relies on a novel application of the Baur-Strassen theorem, a powerful algorithmic tool underpinning the famous backpropagation algorithm.

Near-Optimal Directed Euclidean Spanners in High Dimensions
Rajesh Jayaram (Google Research, USA); Shyamal Patel, Cliff Stein (Columbia University); Erik Waingarten, Tian Zhang (University of Pennsylvania)

Abstract:
For any ε ∈ (0, 1), we give a randomized algorithm which, given n points in (R^d, &ell;_p) for p ∈ [1, 2], constructs a directed graph using Õ(n^2−Ω(ε)) edges in nearly-matching time, such that shortest path lengths approximate &ell;_p-distances up to a (1 + ε)-factor.

The graph uses non-metric Steiner nodes (known to be necessary) and improves upon the prior construction of Andoni and Zhang using Õ(n^2−Ω(ε²)) edges. We show that our construction is nearly optimal by proving that there exists a set of points in
R^d where any (1 + ε)-approximate directed Steiner spanner must use
Ω̃(n^2−O(ε)) edges.

As further applications, we show that our directed Steiner spanner gives faster algorithms for Wasserstein-q distances over (R^d, &ell;_p).

Approximate Orthogonal Vectors and Diameter via Regularity Lemma
Alexandr Andoni (Columbia University, USA); Shunhua Jiang (Hebrew University, Israel); Stepan Zharkov (Columbia University, USA)

Abstract:
We develop algorithms for the approximate Orthogonal Vectors (OV) and Diameter problems over the Hamming space. Prior work exhibited an intriguing sharp transition: for approximation factor c = 2, the algorithms are simple and run in Õ(nd) time; whereas already for c = 2 − δ, the best known approach has been to reduce the problems to nearest neighbor search, leading to solutions with runtimes of the form n^1+Ω(1).

Our algorithms solve (2 − δ)-approximate OV and Diameter with runtimes of n^1+O(δ) and n^1+O(√δ), respectively. The improvement also holds for the online (data structure) versions: online OV and Furthest Neighbor Search (FNS). This is the first direct improvement for approximate FNS in the Hamming space since [Goel, Indyk, Varadarajan 2001].

Our approach consists of two key steps:

First, we define a “heterogeneous” pseudo-random instance of the problems and prove a structural lemma showing that any such instance is solved by one of three simple algorithms.
Second, we develop a specialized regularity lemma that allows one to reduce any arbitrary dataset to such a pseudo-random instance.

Improved Bounds for Coin Flipping, Leader Election, and Random Selection
Eshan Chattopadhyay, Mohit Gurumukhani, Noam Ringach (Cornell University); Rocco A. Servedio (Columbia University)

Abstract:
Random selection is a fundamental task in fault-tolerant distributed computing where processors select a random outcome from some domain. Two special cases of this, leader election (where the processors designate a leader amongst themselves) and collective coin flipping (where the processors agree on a common random bit), have been especially widely studied.

We study these problems in the full-information model, where processors communicate via a single broadcast channel, have access to private randomness, and face a computationally unbounded adversary that controls some of the processors. Despite decades of study, key gaps remain in our understanding of the trade-offs between round complexity, communication per player in each round, and adversarial resilience.

We make progress by proving new lower bounds for coin flipping protocols and both new upper and lower bounds for leader election and random selection protocols. We first show that any k-round coin flipping protocol, where each of &ell; players sends 1 bit per round, can be biased by O(&ell; / log^(k)(&ell;)) bad players. We obtain the same lower bound (with an additional log^(k+1)(&ell;) factor in the numerator) for leader election as well.

This strengthens the previous best lower bounds [RSZ, SICOMP 2002], which ruled out coin flipping protocols resilient to O(&ell; / log^(2k−1)(&ell;)) bad players and leader election protocols resilient to O(&ell; / log^(2k+1)(&ell;)) bad players.

As a consequence, we establish that any protocol tolerating a linear fraction of corrupt players, while restricting player messages to 1 bit per round, must run for at least log^* &ell; − O(1) rounds, improving on the prior best lower bound of 1/2 log^* &ell; − log^*log^*&ell;.

We additionally show that the current best protocols that handle a linear number of corrupt players (from [RZ, JCSS 2001], [F, FOCS 1999]) are near optimal in terms of round complexity and communication per player in a round.

We next initiate the study of one-round random selection protocols where each player sends 1 bit in the round. For all m ≥ (log(&ell;))², we obtain an optimal one-round protocol: We construct a protocol that is resilient to O(&ell;/m) bad players, outputting m uniform random bits. We also show that any protocol that outputs m uniform random bits can be corrupted using O(&ell;/m) bad players.

As far as we are aware, this is the first provably optimal protocol for any task in the full-information model. As a consequence of our construction, we obtain a one-round leader election protocol resilient to &ell;/(log(&ell;))² bad players, improving on the previous best protocol from [RZ, JCSS 2001] that is resilient to only &ell;/(log(&ell;))³ bad players and requires players to send many bits.

When m = (log(&ell;))², our resilience parameter matches that of the best one-round coin flipping protocol by Ajtai and Linial, which only outputs one bit. To obtain our lower bound, we introduce and study multi-output influence, a natural extension of the notion of influence of Boolean functions to the multi-output setting.

An Optimal Algorithm for Stochastic Vertex Cover

Jan van den Brand (Georgia Tech); Inge Li Gørtz (DTU Compute); Chirag Pabbaraju (Stanford University); Debmalya Panigrahi (Duke University); Cliff Stein (Columbia University); Miltiadis Stouras, Ola Svensson (EPFL); Ali Vakilian (Virginia Tech)

Abstract:
The goal in the stochastic vertex cover problem is to obtain an approximately minimum vertex cover for a graph G^∗ that is realized by sampling each edge independently with some probability p ∈ (0, 1] in a base graph G = (V, E).

The algorithm is given the base graph G and the probability p as inputs, but its only access to the realized graph G^∗ is through queries on individual edges in G that reveal the existence (or not) of the queried edge in G^∗.

In this paper, we resolve the central open question for this problem: to find a (1 + ε)-approximate vertex cover using only O_ε(n/p) edge queries.

Prior to our work, there were two incomparable state-of-the-art results for this problem:

A (3/2 + ε)-approximation using O_ε(n/p) queries (Derakhshan, Durvasula, and Haghtalab, 2023).
A (1 + ε)-approximation using O_ε((n/p) · RS(n)) queries (Derakhshan, Saneian, and Xun, 2025), where RS(n) is known to be at least 2^{Ω(log n/log log n)} and could be as large as n/2^{Θ(log^* n)}.

Our improved upper bound of O_ε(n/p) matches the known lower bound of Ω(n/p) for any constant-factor approximation algorithm for this problem (Behnezhad, Blum, and Derakhshan, 2022).

A key tool in our result is a new concentration bound for the size of minimum vertex cover on random graphs, which might be of independent interest.

Learning Functions of Halfspaces
Josh Alman, Shyamal Patel, Rocco A. Servedio (Columbia University)

Abstract:
We give an algorithm that learns arbitrary Boolean functions of k arbitrary halfspaces over Rⁿ, in the challenging distribution-free Probably Approximately Correct (PAC) learning model, running in time 2^{√n · (log n)^O(k)}.

This is the first algorithm that can PAC learn even intersections of two halfspaces in time 2^o(n).

High Rate Efficient Local List Decoding from HDX
Yotam Dikstein (IAS); Max Hopkins (Princeton University); Toniann Pitassi (Columbia University); Russell Impagliazzo (University of California, San Diego)

Abstract:
We construct the first (locally computable, approximately) locally list decodable codes with rate, efficiency, and error tolerance approaching the information theoretic limit, a core regime of interest for the complexity-theoretic task of hardness amplification.

Our algorithms run in polylogarithmic time and sub-logarithmic depth, which together with classic constructions in the unique decoding (low-noise) regime leads to the resolution of several long-standing problems in coding and complexity theory:

Near-optimally input-preserving hardness amplification (and corresponding fast PRGs).
Constant-rate codes with log(N)-depth list decoding (RNC¹).
Complexity-preserving distance amplification.

Our codes are built on the powerful theory of (local-spectral) high-dimensional expanders (HDX).

At a technical level, we make two key contributions. First, we introduce a new framework for (polylog(N)-round) belief propagation on HDX that leverages a mix of local correction and global expansion to control error build-up while maintaining high rate.

Second, we introduce the notion of strongly explicit local routing on HDX: local algorithms that, given any two target vertices, output a random path between them in only polylogarithmic time (and, preferably, sub-logarithmic depth).

Constructing such schemes on certain coset HDX allows us to instantiate our otherwise combinatorial framework in polylogarithmic time and low depth, completing the result.

Improved Pseudorandom Codes from Permuted Puzzles
Miranda Christ (Columbia University); Noah Golowich (Microsoft Research); Sam Gunn (UC Berkeley); Ankur Moitra (Massachusetts Institute of Technology, USA); Daniel Wichs (Northeastern)

Abstract:
Watermarks are an essential tool for identifying AI-generated content. Recently, Christ and Gunn (CRYPTO ’24) introduced pseudorandom error-correcting codes (PRCs), which are equivalent to watermarks with strong robustness and quality guarantees.

A PRC is a pseudorandom encryption scheme whose decryption algorithm tolerates a high rate of errors. Pseudorandomness ensures quality preservation of the watermark, and error tolerance of decryption translates to the watermark’s ability to withstand modification of the content.

In the short time since the introduction of PRCs, several works (NeurIPS ’24, RANDOM ’25, STOC ’25) have proposed new constructions. Curiously, all of these constructions are vulnerable to quasipolynomial-time distinguishing attacks. Furthermore, all lack robustness to edits over a constant-sized alphabet, which is necessary for a meaningfully robust LLM watermark. Lastly, they lack robustness to adversaries who know the watermarking detection key.

Until now, it was not clear whether any of these properties was achievable individually, let alone together. We construct pseudorandom codes that achieve all of the above:

Plausible subexponential pseudorandomness security.
Robustness to worst-case edits over a binary alphabet.
Robustness against even computationally unbounded adversaries that have the detection key.

Pseudorandomness rests on a new assumption that we formalize, the permuted codes conjecture, which states that a distribution of permuted noisy codewords is pseudorandom.

We show that this conjecture is implied by the permuted puzzles conjecture used previously to construct doubly efficient private information retrieval. To give further evidence, we show that the conjecture holds against a broad class of simple distinguishers, including read-once branching programs.

Testing Distributions Against Bounded Distinguishers
Mark Bun, Rathin Desai (Boston University); Renato Ferreira Pinto Jr. (Columbia University)

Abstract:
Motivated by the challenge of testing distributions over continuous or high-dimensional domains, we study distribution testing with respect to bounded classes of distinguishers.

A representative task is to use samples from an unknown distribution P over a very large domain to decide between two cases:

P = P_ref for a fixed reference distribution P_ref.
There exists a distinguisher f in a bounded class F which witnesses the separation
|E_P[f] − E_{P_ref}[f]| > ε.

This is the task of identity testing with respect to fooling distance, a name inspired by the conceptual connection with pseudorandomness. (Formally, our model instantiates integral probability metrics from Boolean classes of bounded expressivity.)

We show that testing with respect to fooling distance is not only a natural computational problem that admits sample-efficient algorithms even in high-dimensional settings, but it also reveals and underlies connections between three seemingly unrelated areas of study:

Testable learning.
Verification of learning algorithms.
Testing of structured distributions (whose “A_k-testing” model our framework extends).

These connections yield new results for all of these models, including:

Testable proper learners using membership queries for halfspaces and decision trees.
A lower bound for testable PAC verification in terms of Rademacher complexity, and a distribution-free verification protocol for disjoint unions of k multidimensional rectangles.
Identity testers (with respect to total variation distance) for decision tree distributions and distributions with low-degree polynomial densities, over Boolean and continuous hypercube domains.

The Theory Group Wins Big at STOC 2026

Computer Science at Columbia University

Upcoming Events

In the News

Press Mentions