Four Papers on Computer Science Theory Accepted to STOC 2019

The Symposium on Theory of Computing (STOC) covers research within theoretical computer science, such as algorithms and computation theory. This year, four papers from CS researchers and collaborators from various institutions made it into the conference.

Local Decodability of the Burrows-Wheeler Transform
Sandip Sinha Columbia University and Omri Weinstein Columbia University

The researchers were interested in the problem of compressing texts with local context, like texts in which there is some correlation between nearby characters. For example, the letter ‘q’ is almost always followed by ‘u’ in an English text.

It is a reasonable goal to design compression schemes that exploit local context to reduce the length of the string considerably. Indeed, the FM-Index and other such schemes, based on a transformation called the Burrows-Wheeler transform followed by Move-to-Front encoding, have been widely used in practice to compress DNA sequences etc. “I think it’s interesting that compression schemes have been known for nearly 20 years in the pattern-matching and bioinformatics community but there has not been satisfactory theoretical guarantees of the compression achieved by these algorithms,” said Sandip Sinha, a PhD student in the Theory Group.

Moreover, these schemes are inherently non-local – in order to extract a character or a short substring at a particular position of the original text, one needs to decode the entire string, which requires time proportional to the length of the original string. This is prohibitive in many applications. The team designed a data structure which matches almost exactly the space bound of such compression schemes, while also supporting highly efficient local decoding queries (alluded to above), as well as certain pattern-matching queries. In particular, they were able to design a succinct “locally-decodable” Move-to-Front (MTF) code, that reduces the decoding time per character (in the MTF encoding) from n to around log(n), where n is the length of the string. Shared Sinha, “We also show a lower bound showing that for a wide class of strings, one cannot hope to do much better using any data structure based on the above transform.”

“Hopefully our paper draws wider attention of the theoretical CS community to similar problems in these fields,” said Sinha. To that end, they have made a conscious effort to make the paper accessible across research domains. “I also think there is no significant mathematical knowledge required to understand the paper, beyond some basic notions in information theory.”


Fooling Polytopes
Ryan O’Donnell Carnegie Mellon University, Rocco A. Servedio Columbia University, Li-Yang Tan Stanford University

The paper is about “getting rid of the randomness in random sampling”.  

Suppose you are given a complicated shape on a blackboard and you need to estimate what fraction of the blackboard’s area is covered by the shape.  One efficient way to estimate this fraction is by doing random sampling: throw darts randomly at the blackboard and count the fraction of the darts that land inside the shape. If you throw a reasonable number of darts, and they land uniformly at random inside the blackboard, the fraction of darts that land inside the shape will be a good estimate of the actual fraction of the blackboard’s area that is contained inside the shape. (This is analogous to surveying a small random sample of voters to try and predict who will win an election.)  

“This kind of random sampling approach is very powerful,” said Rocco Servedio, professor and chair of the computer science department. “In fact, there is a sense in which every randomized computation can be viewed as doing this sort of random sampling.”  

It is a fundamental goal in theoretical computer science to understand whether randomness is really necessary to carry out computations efficiently.  The point of this paper is to show that for an important class of high-dimensional estimation problems of the sort described above, it is actually possible to come up with the desired estimates efficiently without using any randomness at all.

In this specific paper, the “blackboard” is a high-dimensional Boolean hypercube and the “shape on the blackboard” is a subset of the hypercube defined by a system of high-dimensional linear inequalities (such a subset is also known as a polytope).  Previous work had tried to prove this result but could only handle certain specialized types of linear inequalities. By developing some new tools in high dimensional geometry and probability, in this paper the researchers were able to get rid of those limitations and handle all systems of linear inequalities.


Static Data Structure Lower Bounds Imply Rigidity
Zeev Dvir Princeton University, Alexander Golovnev Harvard University, Omri Weinstein Columbia University

The paper shows an interesting connection between the task of proving time-space lower bounds on data structure problems (with linear queries), and the long-standing open problem of constructing “stable” (rigid) matrices — a matrix M whose rank remains very high unless a lot of entries are modified. Constructing rigid matrices is one of the major open problems in theoretical computer science since the late 1970s, with far-reaching consequences on circuit complexity.

The result shows a real barrier for proving lower bounds on data structures: If one can exhibit any “hard” data structure problem with linear queries (the canonical example being Range Counting queries: given n points in d dimensions, report the number of points in a given rectangle), then this problem can be essentially used to construct “stable” (rigid) matrices.

“This is a rather surprising ‘threshold’ result, since in slightly weaker models of data structures (with small space usage), we do in fact have very strong lower bounds on the query time,” said Omri Weinstein, an assistant professor of computer science. “Perhaps surprisingly, our work shows that anything beyond that is out of reach with current techniques.”


Testing Unateness Nearly Optimally
Xi Chen Columbia University, Erik Waingarten Columbia University

The paper is about testing unateness of Boolean functions on the hypercube.

For this paper the researchers set out to design highly efficient algorithms which, by evaluating very few random inputs of a Boolean function, can “test” whether the function is unate (meaning that every variable is either non-increasing or non-decreasing or is pretty non-unate).

Referring to a previous paper the researchers set out to create an algorithm which is optimal (up to poly-logarithmic factors), giving a lower bound on the complexity of these testing algorithms.

An example of a Boolean function which is unate is a halfspace, i.e., for some values w1, …, wn, θ ∈ ℝ, the function f : {0,1}n → {0,1} is given by f(x) = 1 if ∑ wi xi≥ θ and 0 otherwise. Here, every variable i ∈ [n] is either non-decreasing, when wi ≥ 0, or non-increasing, when wi ≤ 0.

“One may hope that such an optimal algorithm could be non-adaptive, in the sense that all evaluations could be done at once,” said Erik Waingarten, an algorithms and computational complexity PhD student. “These algorithms tend to be easier to analyze and have the added benefit of being parallelize-able.”

However, the algorithm they developed is crucially adaptive, and a surprising thing is that non-adaptive algorithms could never achieve optimal complexity. A highlight of the paper is a new analysis of a very simple binary search procedure on the hypercube.

“This procedure is the ‘obvious’ thing one would do for these kinds of algorithms, but analyzing it has been very difficult because of its adaptive nature,” said Waingarten. “For us, this is the crucial component of the algorithm.”

Research That Studied Data From Social Media, Automatic Summarization, and Spatial Relations Accepted to NAACL 2019

The annual conference of the North American Chapter of the Association for Computational Linguistics (NAACL) is the preeminent event in the field of natural language processing. CS researchers in professor Julia Hirschberg’s group won a Best Paper award for a novel resource, SpatialNet, which provides a formal representation of how a language expresses spatial relations. Other accepted papers are detailed below.

Linguistic Analysis of Schizophrenia in Reddit Posts
Jonathan Zomick Hofstra University, Sarah Ita Levitan Columbia University, Mark Serper Hofstra University Mount Sinai School of Medicine

The paper was presented at the Sixth Annual Workshop on Computational Linguistics and Clinical Psychology, at NAACL.

The researchers identified and analyzed unique linguistic characteristics of Reddit posts written by users who claim to have received a diagnosis for schizophrenia. The findings were interpreted in the context of established schizophrenia symptoms and compared with results from previous research that has looked at schizophrenia and language on social media platforms.  

The results showed several differences in language usage between users with schizophrenia and a control group. For example, people with schizophrenia used less punctuation in their Reddit posts. Disorganized language use is a prominent and common symptom of schizophrenia.

A machine learning classifier was trained to automatically identify self-identified users with schizophrenia on Reddit, using linguistic cues.  

“We hope that this work contributes toward the ultimate goal of identifying high risk individuals,” said Sara Ita Levitan, a postdoctoral research scientist with the Spoken Language Processing Group. “Early intervention and diagnosis is important to improve overall treatment outcomes for schizophrenia.”


Fixed That for You: Generating Contrastive Claims with Semantic Edits
Christopher Hidey Columbia University and Kathy McKeown Columbia University

For many people, social media is a primary source of information and it can become a key venue for opinionated discussion. In order to evaluate and analyze these discussions, it is important to understand contrast or a difference in opinions.

As a step towards a better understanding of arguments, the researchers developed a method to automatically generate responses to internet comments containing differences in stance. They created a corpus from over one million contrastive claims mined from the social media site Reddit. In order to obtain training data for the models, they extracted pairs of comments containing the acronym FTFY (“fixed that for you”).  

For example, in a discussion over who should be the next President of the United States, one participant might state “Bernie Sanders for president” and another might state “Hillary Clinton for president. FTFY”

A neural network model was trained on the pairs to edit the original claim and produce a new claim with a different view. 

Claim : Bernie Sanders for president
New claim : Hillary Clinton for president.

“One aspect of this problem that was surprising was that the standard ‘sequence-to-sequence with attention’ baseline performed poorly, often just copying the output or selecting generic responses,” said Christopher Hidey, a fourth year PhD student. While generic response generation is a known problem in neural models, their custom model significantly outperformed this baseline in several metrics including novelty and overlap with human-generated responses.


A Robust Abstractive System for Cross-Lingual Summarization
Jessica Ouyang Columbia University, Boya Song Columbia University, and Kathy McKeown Columbia University

The researchers developed an automatic summarization system that specializes in producing English summaries for documents originally written in three low-resource languages – Somali, Swahili, and Tagalog.

There is little natural language processing work done in low-resource languages and machine translation systems for those languages are of lower quality than those for high-resource languages like French or German.

As a result, the translations are often disfluent and contain errors that make them difficult for a human to understand, much less for a summarization system to process.

An example of machine-translated document
originally written in Swahili :

Mange Kimambi ‘I pray for the parliamentary seat for Kinondoni
constituency for ticket of CCM. Not special seats’ Kinondoni
without drugs is possible I pray for the parliamentary seat for
Kinondoni constituency on the ticket of CCM. Yes, it’s not a
special seats, Khuini Kinondoni, what will I do for Kinondoni?
Tension is many I get but we must remember no good that is
available easily. Kinondoni without drugs is possible. As a friend,
fan or patriotism I urge you to grant your contribution to the
situation and propert. You can use Western Union or money to go
to Mange John Kimambi. Account of CRDB Bank is on blog. Reduce
my profile in my blog understand why I have decided to vie for
Kinondoni constituency. you will understand more.

A standard summarization system’s
output on the document :

Mange Kimambi, who pray for parliamentary seat for Kinondoni
constituency for ticket of CCM, is on blog, and not special seats’
Kinondoni without drugs.

The robust summarization system’s output
on the document :

Mange Kimambi, who pray for parliamentary seat for Kinondoni
constituency for ticket of CCM, comments on his plans to vie for
‘Kinondoni’ without drugs.

“We addressed this challenge by creating large collections of synthetic, errorful “translations” that mimic the output of low-quality machine translations,” said Jessica Ouyang, a seventh year PhD student. They paired the problematic text with high-quality, human-written summaries. The experiment showed that a neural network summarizer trained on this synthetic data was able to correct or elide translation errors and produce fluent English summaries. The error-correcting ability of the system extends to Arabic, a new language previously unseen by the system.


IMHO Fine-Tuning Improves Claim Detection
Tuhin Chakrabarty Columbia University, Christopher Hidey Columbia University, and Kathy McKeown Columbia University

Argument mining, or argumentation mining, is a research area within the natural language processing field. Argument mining is applied in many different genres including the qualitative assessment of social media content (e.g. Twitter, Facebook) – where it provides a powerful tool for policy-makers and researchers in social and political sciences – legal documents, product reviews, scientific articles, online debates, newspaper articles, and dialogical domains. One of the main tasks of argument mining is to detect a claim.

Sentences from each dataset and their nearest neighbor in the IMHO dataset

Claims are the central component of an argument. Detecting claims across different domains or data sets can often be challenging due to their varying conceptualization. The researchers set out to alleviate this problem by fine-tuning a language model. They created a corpus mined from Reddit that is composed of 5.5 million opinionated claims. These claims are self-labeled by their authors using the internet acronyms IMO/IMHO or “In My Humble Opinion”.

By fine-tuning the language on the IMHO dataset they were able to obtain a significant improvement on claim detection of the datasets. As these data sets include diverse domains such as social media and student essays, this improvement demonstrates the robustness of fine-tuning on this novel corpus.


The Answer is Language Model Fine-tuning
Tuhin Chakrabarty Columbia University and Smaranda Muresan Columbia University

Community Question Answering forums such as Yahoo! Answers and Quora are popular nowadays, as they represent effective means for communities to share information around particular topics. But the information often shared on these forums may be incorrect or misleading.

The paper presents the ColumbiaNLP submission for the SemEval-2019 Task 8: Fact-Checking in Community Question Answering Forums. The researchers show how fine-tuning a language model on a large unannotated corpus of old threads from the Qatar Living forum helps to classify question types (factual, opinion, socializing) and to judge the factuality of answers on the shared task labeled data from the same forum. Their system finished 4th and 2nd on Subtask A (question type classification) and B (answer factuality prediction), respectively, based on the official metric of accuracy.

Question classification
Factual : The question is asking for factual information,
which can be answered by checking various information
sources, and it is not ambiguous.
e.g. “What is Ooredoo customer service number?”
Opinion : The question asks for an opinion or advice,
not for a fact.
e.g. “Can anyone recommend a good Vet in Doha?”
Socializing : Not a real question, but intended for
socializing or for chatting. This can also mean expressing
an opinion or sharing some information, without really
asking anything of general interest.
e.g. “What was your first car?”

Answer classification
Factual – TRUE : The answer is True and can be proven
with an external resource.
Q : “I wanted to know if there were any specific shots and
vaccinations I should get before coming over [to Doha].”
A : “Yes there are; though it varies depending on which
country you come from. In the UK; the doctor has a list of
all countries and the vaccinations needed for each.”
Factual – FALSE : The answer gives a factual response, but
it is False, it is partially false or the responder is unsure about
Q : “Can I bring my pitbulls to Qatar?”
A : “Yes, you can bring it but be careful this kind of dog is
very dangerous.”
Non-Factual : When the answer does not provide factual
information to the question; it can be an opinion or an advice
that cannot be verified
e.g. “It’s better to buy a new one.”

“We show that fine-tuning a language model on a large unsupervised corpus from the same community forum helps us achieve better accuracy for question classification,” said Tuhin Chakrabarty, lead researcher of the paper. Most community question-answering forums have such unlabeled data, which can be used in the absence of large labeled training data.

For answer classification they show how to leverage information from previously answered questions on the thread through language model fine-tuning. Their experiments also show that modeling an answer individually is not the best idea for fact-verification and results are improved when considering the context of the question.

“Determining factuality of answers requires modeling of world knowledge or external evidence – the questions asked are often very noisy and require reformulation,” shared Chakrabarty. “As a future step we would want to incorporate external evidence from the internet in the factual answer classification problem.”


Identifying Therapist Conversational Actions Across Diverse Psychotherapeutic Approaches
Fei-Tzin Lee Columbia University, Derrick Hull Talkspace, Jacob Levine Talkspace, Bonnie Ray Talkspace and Kathleen McKeown Columbia University

The paper studied dialogue act classification in therapy transcripts. Dialogue act classification is a task in which the researchers attempt to determine the intention of the speaker at each point in a dialogue, classifying it into one of a fixed number of possible types. This provides a layer of abstraction away from what the speaker is literally saying, giving a higher-level view of the conversation. Ultimately, they hope this work can help analyze the dynamics of text-based therapy on a large scale.
 
Transcripts of therapy sessions were examined, focusing on the speech of the therapist, using a classification scheme developed for this purpose. On a sentence-by-sentence basis, they found which of the labels best matches the conversational “action” the sentence takes.
 
For example, if a therapist makes the statement, “It almost feels like if you could do something, anything would be better than…” This would be classified into the Reflection category, as it is rephrasing or restating the experience the client just described, but in a way that makes what they are feeling more explicit.
 
“One of the interesting results from this research came when we analyzed the performance of our best classifier across different styles of therapy,” said Fei-Tzin Lee, a third year PhD student. Certain styles were markedly easier to classify than others; this was not simply a case where the classifier performed better on therapeutic styles for which there was more data.
 
Generally, it seemed that therapy styles involving more complex sentence structure were more difficult to classify, although to fully understand the differences between styles further work would be necessary. Continued Lee, “Regardless of the reason, it was interesting to note that there are marked differences that are quantitatively measurable between different styles.”