# Events

Nov 30

#### NLP Seminar - Weiyan Shi

10:00 AM to 11:00 AM

CS Conference Room

Weiyan Shi

Contact Emily Allaway or Fei-Tzin Lee for Zoom link and talk details.

Dec 01

#### Linear Probing Revisited: Tombstones Mark the Demise of Primary Clustering

1:00 PM to 2:00 PM

CSB 453

William Kuszmaul, MIT

Seminars are currently held in a hybrid mode—we meet in-person in CSB 453 and virtually over Zoom link/password sent via email to theoryread; general information about the mailing list—including how to subscribe to it—can be found here.

Abstract:
The linear-probing hash table is one of the oldest and most widely used data structures in computer science. However, linear probing also famously comes with a major drawback: as soon as the hash table reaches a high memory utilization, elements within the hash table begin to cluster together, causing insertions to become slow. This phenomenon, now known as "primary clustering", was first captured by Donald Knuth in 1963; at a load factor of $1 - 1/x$, the expected time per insertion becomes $\Theta(x^2)$, rather than the more desirable $\Theta(x)$.

We show that there is more to the story than the classic analysis would seem to suggest. It turns out that small design decisions in how deletions are implemented have dramatic effects on the asymptotic performance of insertions. If these design decisions are made correctly, then even a hash table that is continuously at a load factor $1 - \Theta(1/x)$ can achieve average insertion time $\tilde{O}(x)$. A key insight is that the tombstones left behind by deletions cause a surprisingly strong "anti-clustering" effect, and that when insertions and deletions are one-for-one, the anti-clustering effects of deletions actually overpower the clustering effects of insertions.

We also present a new variant of linear probing, which we call "graveyard hashing", that completely eliminates primary clustering on any sequence of operations. If, when an operation is performed, the current load factor is $1 - 1/x$ for some $x$, then the expected cost of the operation is $O(x)$. One corollary is that, in the external-memory model with a data block size of $B$, graveyard hashing offers the following remarkable guarantee: at any load factor $1-1/x$ satisfying $x = o(B)$, graveyard hashing achieves $1 + o(1)$ expected block transfers per operation. Past external-memory hash tables have only been able to offer a $1+o(1)$ guarantee when the block size $B$ is at least $\Omega(x^2)$.

Based on joint work with Michael A. Bender and Bradley C. Kuszmaul.
https://arxiv.org/abs/2107.01250
To appear in FOCS 2021.

Dec 08

#### Properly learning decision trees in almost polynomial time

1:00 PM to 2:00 PM

CSB 453

Guy Blanc, Stanford University

Seminars are currently held in a hybrid mode—we meet in-person in CSB 453 and virtually over Zoom link/password sent via email to theoryread; general information about the mailing list—including how to subscribe to it—can be found here.

Abstract:
We give an n^{O(loglog n)}-time membership query algorithm for properly and agnostically learning decision trees under the uniform distribution over {-1,1}^n. Even in the realizable setting, the previous fastest runtime was n^{O(log n)}, a consequence of a classic algorithm of Ehrenfeucht and Haussler.

Our algorithm shares similarities with practical heuristics for learning decision trees, which we augment with additional ideas to circumvent known lower bounds against these heuristics. To analyze our algorithm, we prove a new structural result for decision trees that strengthens a theorem of O'Donnell, Saks, Schramm, and Servedio. While the OSSS theorem says that every decision tree has an influential variable, we show how every decision tree can be "pruned" so that every variable in the resulting tree is influential.
Joint work with Jane Lange, Mingda Qiao, and Li-Yang Tan.
https://arxiv.org/abs/2109.00637
To appear in FOCS 2021.