About Me
My broad interests are in the intersection of computational linguistics
and machine learning. I am interested in developing ways for reasoning
about compositional structures such as parse trees through the use
of formalisms such as probabilistic grammars.
Much of my research has relied on capturing the syntax of natural language using probabilistic grammars -- grammars which originate in linguistics and formal language theory, and which have been augmented with a
probabilistic interpretation. I have worked with various grammars, such as probabilistic context-free grammars (PCFGs), latent-variable PCFGs,
dependency grammars, adaptor grammars, tree substitution grammars, shift-reduce grammars and others.
Less information.
I work with Michael Collins.
I received my doctoral degree from Carnegie Mellon University, where I worked with my advisor, Noah Smith. Before that, I completed a bachelor's degree (mathematics
and computer science) and a master's degree (computer science) in Tel-Aviv
University. I am now supported by an NSF/CRA CI Fellowship.
Here is a link to my dissertation, titled Computational Learning of Probabilistic Grammars in the Unsupervised Setting.
News:
- 7/26/13 - A blog post about our upcoming ACL paper: manually generated text.
- 7/25/13 - A video of a talk I gave in NAACL about spectral learning of latent-variable PCFGs is available here from techtalks.tv. The paper is available here.
Publications [show all abstracts]
- Spectral Learning of Refinement HMMs, Karl Stratos, Alexander M. Rush, Shay B. Cohen and Michael Collins, CoNLL 2013 (accepted) [pdf] [abstract] [bibtex]
We derive a spectral algorithm for learning the parameters of a refinement HMM.
This method is simple, efficient, and can be applied to a wide range of supervised
sequence labeling tasks. Like other spectral methods, it avoids the problem of local optima and provides a consistent estimate of the parameters. Our experiments
on a phoneme recognition task show that when equipped with informative feature functions, it performs significantly better than a supervised HMM and competitively
with EM.
@inproceedings{stratos-13,
author = "K. Stratos and A. M. Rush and S. B. Cohen and M. Collins",
title = "Spectral Learning of Refinement {HMMs}",
booktitle = "Proceedings of {CoNLL}",
year = "2013"
}
- The Effect of Non-tightness on Bayesian Estimation of PCFGs, Shay B. Cohen and Mark Johnson, ACL 2013 (accepted) [pdf] [mathematica output] [abstract] [bibtex] [blog post]
Probabilistic context-free grammars have the unusual property of not always defining tight distributions (i.e., the sum of the ``probabilities'' of the trees the grammar generates can
be less than one). This paper reviews how this non-tightness can arise and discusses its impact on Bayesian estimation of PCFGs. We
begin by presenting the notion of ``almost everywhere tight grammars'' and show that linear CFGs follow it. We then propose three different ways of reinterpreting non-tight PCFGs
to make them tight, show that the Bayesian estimators in Johnson et al. (2007) are correct
under one of them, and provide MCMC samplers for the other two. We conclude with a
discussion of the impact of tightness empirically.
@inproceedings{cohen-13c,
author = "S. B. Cohen and M. Johnson",
title = "The Effect of Non-tightness on Bayesian Estimation of {PCFGs}",
booktitle = "Proceedings of {ACL}",
year = "2013"
}
- Experiments with Spectral Learning of Latent-Variable PCFGs, Shay B. Cohen, Karl Stratos, Michael Collins, Dean P. Foster and Lyle Ungar, In NAACL 2013 [pdf] [abstract] [bibtex] [talk]
Latent-variable PCFGs (L-PCFGs) are a highly successful model for
natural language parsing. Recent work (Cohen et al., 2012) has introduced a
spectral algorithm for parameter estimation of L-PCFGs, which---unlike
the EM algorithm---is guaranteed to give consistent parameter estimates
(it has PAC-style guarantees of sample complexity). This paper
describes experiments using the spectral algorithm. We show that the algorithm
provides models with the same accuracy as EM, but is an order of
magnitude more efficient.
We describe a number of key steps used to obtain this
level of performance; these should be relevant to other work on the
application of spectral learning algorithms. We view our results as strong
empirical evidence for the viability of spectral methods as an
alternative to EM.
@inproceedings{cohen-13b,
author = "S. B. Cohen and K. Stratos and M. Collins and D. P. Foster and L. Ungar",
title = "Experiments with Spectral Learning of Latent-Variable {PCFGs}",
booktitle = "Proceedings of {NAACL}",
year = "2013"
}
- Approximate PCFG Parsing Using Tensor Decomposition, Shay B. Cohen, Giorgio Satta and Michael Collins, In NAACL 2013 (accepted) [pdf] [abstract] [bibtex]
We provide an approximation algorithm for PCFG parsing, which asymptotically improves time complexity with respect to the input grammar size, and prove upper bounds on the approximation quality.
We test our algorithm on two treebanks, and get significant improvements in parsing speed.
@inproceedings{cohen-13a,
author = "S. B. Cohen and G. Satta and M. Collins",
title = "Approximate PCFG Parsing Using Tensor Decomposition",
booktitle = "Proceedings of {NAACL}",
year = "2013"
}
- Tensor Decomposition for Fast Parsing with Latent-Variable PCFGs, Shay B. Cohen and Michael Collins, In NIPS 2012 [pdf] [abstract] [bibtex]
We describe an approach to speed-up inference with latent-variable PCFGs, which have been shown to
be highly effective for natural language parsing. Our approach is based on a tensor formulation recently
introduced for spectral estimation of latent-variable PCFGs coupled with a tensor decomposition
algorithm well-known in the multilinear algebra literature.
We also describe an error bound for this approximation, which gives guarantees showing
that if the underlying tensors are well approximated, then the probability distribution
over trees will also be well approximated.
Empirical evaluation on real-world natural
language parsing data demonstrates a significant speed-up at minimal cost for parsing performance.
@inproceedings{cohen-12c,
author = "S. B. Cohen and M. Collins",
title = "Tensor Decomposition for Fast Parsing with Latent-Variable {PCFGs}",
booktitle = "Proceedings of NIPS",
year = "2012"
}
- Elimination of Spurious Ambiguity in Transition-Based Dependency Parsing, Shay B. Cohen, Carlos Gómez-Rodríguez and Giorgio Satta, In arXiv (1206.6735), 2012 [pdf] [abstract] [bibtex]
We present a novel technique to remove spurious ambiguity from transition systems for dependency parsing. Our technique chooses a canonical sequence of transition operations (computation) for a given dependency tree. Our technique can be applied to a large class of bottom-up transition systems, including for instance Nivre (2004) and Attardi (2006).
@techreport{cohen-12b,
author = "S. B. Cohen and C. G{\'o}mez-Rodr{\'\i}guez and G. Satta",
title = "Elimination of Spurious Ambiguity in Transition-Based Dependency Parsing",
year = "2012",
eprint = "arXiv:1206.6735",
url = "http://arxiv.org/pdf/1206.6735v1"
}
- Spectral Learning of Latent-Variable PCFGs, Shay B. Cohen, Karl Stratos, Michael Collins, Dean P. Foster and Lyle Ungar, In ACL 2012 [pdf] [longer version, stronger model] [abstract] [bibtex]
We introduce a spectral learning algorithm for latent-variable PCFGs
(Petrov et al., 2006; Matsuzaki et al., 2005). Under a separability (singular
value) condition, we prove that the method provides consistent parameter
estimates. Our result rests on three theorems: the first gives
a tensor form of the inside-outside algorithm for PCFGs; the second
shows that the required tensors can be estimated directly from
training examples where hidden-variable values are missing; the third
gives a PAC-style convergence bound for the estimation method.
@inproceedings{cohen-12a,
author = "S. B. Cohen and K. Stratos and M. Collins and D. P. Foster and L. Ungar",
title = "Spectral Learning of Latent-Variable {PCFGs}",
booktitle = "Proceedings of ACL",
year = "2012"
}
- Empirical Risk Minimization for Probabilistic Grammars: Sample Complexity and Hardness of Learning, Shay B. Cohen and Noah A. Smith, Computational Linguistics (2012) [pdf] [abstract] [bibtex]
Probabilistic grammars are generative statistical models that are useful for compositional and sequential structures. They are used ubiquitously in computational linguistics. We present a framework, reminiscent of structural risk minimization, for empirical risk minimization of probabilistic grammars using the log-loss. We derive sample complexity bounds in this framework that apply both to the supervised setting and the unsupervised setting. By making assumptions about the underlying distribution that are appropriate for natural language scenarios, we are able to derive distribution-dependent sample complexity bounds for probabilistic grammars. We also give simple algorithms for carrying out empirical risk minimization using this framework in both the supervised and unsupervised settings. In the unsupervised case, we show that the problem of minimizing empirical risk is NP-hard. We therefore suggest an approximate algorithm, similar to expectation-maximization, to minimize the empirical risk.
@article{cohen-12c,
author = "S. B. Cohen and N. A. Smith",
title = "Empirical Risk Minimization for Probabilistic Grammars: Sample Complexity and Hardness of Learning",
journal = "Computational Linguistics",
volume = "38",
number = "3",
pages = "479--526",
year = "2012"
}
- Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance, Shay B. Cohen, Dipanjan Das and Noah A. Smith, In EMNLP 2011[pdf] [abstract] [bibtex]
We describe a method for prediction of linguistic structure in a language for which only unlabeled data is available,
using annotated data from a set of one or more helper languages.
Our approach is based on a model that locally mixes between supervised
models from the helper languages.
Parallel data is not used, allowing the technique to be applied even in domains where human-translated texts are unavailable.
We obtain state-of-the-art performance for two tasks of
structure prediction: unsupervised part-of-speech tagging and unsupervised dependency
parsing.
@inproceedings{cohen-11b,
author = "S. B. Cohen and D. Das and N. A. Smith",
title = "Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance",
booktitle = "Proceedings of EMNLP",
year = "2011"
}
- Exact Inference for Generative Probabilistic Non-Projective Dependency Parsing, Shay B. Cohen, Carlos Gómez-Rodríguez and Giorgio Satta, In EMNLP 2011 [pdf] [abstract] [bibtex]
We describe a generative model for non-projective dependency parsing
based on a simplified version of a transition system that has recently appeared in the literature.
We then develop a dynamic programming parsing algorithm for our model,
and derive an inside-outside algorithm that can be used
for unsupervised learning of non-projective dependency trees.
@inproceedings{cohen-11b,
author = "S. B. Cohen and C. G{\'o}mez-Rodr{\'\i}guez and G. Satta",
title = "Exact Inference for Generative Probabilistic Non-Projective Dependency Parsing",
booktitle = "Proceedings of EMNLP",
year = "2011"
}
- Products of Weighted Logic Programs, Shay B. Cohen, Robert J. Simmons and Noah A. Smith, In Theory and Practice of Logic Programming, 2011 [pdf] [abstract] [bibtex]
Weighted logic programming, a generalization of bottom-up logic
programming, is a well-suited framework for specifying dynamic
programming algorithms. In this setting, proofs correspond to the
algorithm's output space, such as a path through a graph or a
grammatical derivation, and are given a real-valued score (often
interpreted as a probability) that depends on the real weights of the base
axioms used in the proof. The desired output is a function over all
possible proofs, such as a sum of scores or an optimal score. We
describe the PRODUCT transformation, which can merge two weighted logic
programs into a new one. The resulting program optimizes a product
of proof scores from the original programs, constituting a scoring
function known in machine learning as a ``product of experts.''
Through the addition of intuitive constraining side conditions, we
show that several important dynamic programming
algorithms can be derived by applying PRODUCT to weighted logic programs
corresponding to simpler weighted logic programs.
@article{cohen-11a,
author = "S. B. Cohen and R. J. Simmons and N. A. Smith",
title = "Products of Weighted Logic Programs",
journal = "Theory and Practice of Logic Programming",
year = "2011"
}
- Empirical Risk Minimization with Approximations of Probabilistic Grammars, Shay B. Cohen and Noah A. Smith, In NIPS, 2010 [pdf] [appendix - pdf] [abstract] [bibtex]
Probabilistic grammars are generative statistical models that are useful for
compositional and sequential structures. We present a framework, reminiscent of structural risk minimization,
for empirical risk minimization of the parameters of a fixed probabilistic
grammar using the log-loss. We derive sample complexity bounds in this framework that apply
both to the supervised setting and the unsupervised setting.
@inproceedings{cohen-10e,
author = "S. B. Cohen and N. A. Smith",
title = "Empirical Risk Minimization with Approximations of Probabilistic Grammars",
booktitle = "Proceedings of {NIPS}",
year = "2010"
}
- Covariance in Unsupervised Learning of Probabilistic Grammars, Shay B. Cohen and Noah A. Smith, In JMLR, 2010 [pdf] [abstract] [bibtex]
Probabilistic grammars offer great flexibility in modeling discrete sequential data like natural language text.
Their symbolic component is amenable to inspection by humans, while their probabilistic
component helps resolve ambiguity. They also permit the use of well-understood, general-purpose learning
algorithms. There has been an increased interest in using probabilistic grammars in the
Bayesian setting. To date, most of the literature has focused on using a Dirichlet prior.
The Dirichlet prior has several limitations, including that it cannot directly model covariance
between the probabilistic grammar's parameters. Yet, various grammar parameters are expected to be correlated because
the elements in language they represent share linguistic properties.
In this paper, we suggest an alternative to the
Dirichlet prior, a family of logistic normal distributions.
We derive an inference algorithm for this family of distributions and
experiment with the task of dependency grammar induction, demonstrating performance improvements with our priors
on a set of six treebanks in different natural languages. Our covariance framework permits soft parameter tying within grammars and across grammars for text in different languages, and we show empirical gains in a novel learning setting using bilingual, non-parallel data.
@article{cohen-10d,
author = "S. B. Cohen and N. A. Smith",
title = "Covariance in Unsupervised Learning of Probabilistic Grammars",
journal = "Journal of Machine Learning Research",
volume = "11",
pages = "3017--3051",
year = "2010"
}
- Viterbi Training for PCFGs: Hardness Results and Competitiveness of Uniform Initialization, Shay B. Cohen and Noah A. Smith, In ACL 2010 [pdf] [abstract] [bibtex]
We consider the search for a maximum likelihood assignment of hidden derivations and
grammar weights for a probabilistic context-free grammar, the problem approximately solved by ``Viterbi training.''
We show that solving and even approximating Viterbi training for PCFGs
is NP-hard. We
motivate the use of uniform-at-random initialization
for Viterbi EM as an optimal initializer in
absence of further information about the correct model
parameters, providing an approximate bound on the
log-likelihood.
@inproceedings{cohen-10c,
author = "S. B. Cohen and N. A. Smith",
title = "Viterbi Training for {PCFGs}: Hardness Results and Competitiveness of Uniform Initialization",
booktitle = "Proceedings of {ACL}",
year = "2010"
}
- Variational Inference for Adaptor Grammars, Shay B. Cohen, David M. Blei and Noah A. Smith, In NAACL 2010 [pdf] [abstract] [bibtex]
Adaptor grammars extend probabilistic context-free grammars to
define prior distributions over trees with ``rich get richer''
dynamics. Inference for adaptor grammars seeks to find parse trees
for raw text. This paper describes a variational inference
algorithm for adaptor grammars, providing an alternative to Markov
chain Monte Carlo methods. To derive this method, we develop a
stick-breaking representation of adaptor grammars, a representation
that enables us to define adaptor grammars with recursion.
We report experimental results on a word segmentation
task, showing that variational inference performs comparably to
MCMC. Further, we show a significant speed-up when parallelizing the
algorithm. Finally, we report promising results for a new
application for adaptor grammars, dependency grammar induction.
@inproceedings{cohen-10b,
author = "S. B. Cohen and D. M. Blei and N. A. Smith",
title = "Variational Inference for Adaptor Grammars",
booktitle = "Proceedings of {NAACL}",
year = "2010"
}
- Variational Inference for Grammar Induction with Prior Knowledge, Shay B. Cohen and Noah A. Smith, In ACL 2009 (short paper track) [pdf] [abstract] [bibtex]
Variational EM has become a popular technique in probabilistic NLP
with hidden variables.
Commonly, for computational tractability,
we make strong independence assumptions, such as the mean-field
assumption, in approximating posterior distributions over hidden
variables.
We show how a looser restriction on the approximate posterior, requiring
it to be a mixture, can help inject prior knowledge to exploit soft constraints
during the variational E-step.
@inproceedings{cohen-09b,
author = "S. B. Cohen and N. A. Smith",
title = "Variational Inference for Grammar Induction with Prior Knowledge",
booktitle = "Proceedings of {ACL}",
year = "2009"
}
- Shared Logistic Normal Distributions for Soft Parameter Tying in Unsupervised Grammar Induction, Shay B. Cohen and Noah A. Smith, In NAACL 2009 [pdf] [abstract] [bibtex]
We present a family of priors over probabilistic grammar weights, called the
shared logistic normal distribution. This
family extends the partitioned logistic normal distribution,
enabling factored covariance between the probabilities of different derivation events in the probabilistic grammar, providing a new
way to encode prior knowledge about an unknown grammar.
We describe a variational EM
algorithm for learning a probabilistic grammar based on this family of priors. We
then experiment with unsupervised dependency grammar induction and show
significant improvements using our model for both monolingual
learning and bilingual learning with a non-parallel, multilingual
corpus.
@inproceedings{cohen-09a,
author = "S. B. Cohen and N. A. Smith",
title = "Shared Logistic Normal Distributions for Soft Parameter Tying in Unsupervised Grammar Induction",
booktitle = "Proceedings of {NAACL}",
year = "2009"
}
- Logistic Normal Priors for Unsupervised Probabilistic Grammar Induction, Shay B. Cohen, Kevin Gimpel and Noah A. Smith, In NIPS 2008 [pdf] [code] [abstract] [bibtex]
We explore a new Bayesian model for probabilistic grammars,
a family of distributions over discrete structures that includes
hidden Markov models and probabilistic context-free grammars.
Our model extends the correlated topic
model framework to probabilistic grammars, exploiting the logistic
normal distribution as a prior over the grammar parameters.
We derive a variational EM algorithm for that model,
and then experiment with the task of unsupervised grammar induction
for natural language dependency parsing. We show that our model
achieves superior results over
previous models that use different priors.
@inproceedings{cohen-08b,
author = "S. B. Cohen and K. Gimpel and N. A. Smith",
title = "Logistic Normal Priors for Unsupervised Probabilistic Grammar Induction",
booktitle = "Proceedings of {NIPS}",
year = "2009"
}
- Dynamic Programming Algorithms as Products of Weighted Logic Programs, Shay B. Cohen, Robert J. Simmons and Noah A. Smith, In ICLP 2008 (best student paper award) [springer] [journal-version] [abstract] [bibtex]
Weighted logic programming, a generalization of bottom-up logic
programming, is a successful framework for specifying dynamic
programming algorithms. In this setting, proofs correspond to the
algorithm's output space, such as a path through a graph or a
grammatical derivation, and are given a weighted score, often
interpreted as a probability, that depends on the score of the base
axioms used in the proof. The desired output is a function over all
possible proofs, such as a sum of scores or an optimal score. We
describe the PRODUCT transformation, which can merge two weighted logic
programs into a new one. The resulting program optimizes a product
of proof scores from the original programs, constituting a scoring
function known in machine learning as a ``product of experts.''
Through the addition of intuitive constraining side conditions, we
show that several important dynamic programming
algorithms can be derived by applying PRODUCT to weighted logic programs corresponding to simpler weighted logic programs.
@inproceedings{cohen-08a,
author = "S. B. Cohen and R. J. Simmons and N. A. Smith",
title = "Dynamic Programming Algorithms as Products of Weighted Logic Programs",
booktitle = "Proceedings of {ICLP}",
year = "2008"
}
- Joint Morphological and Syntactic Disambiguation, Shay B. Cohen and Noah A. Smith, In EMNLP 2007 [pdf] [abstract] [bibtex]
In morphologically rich languages, should morphological and
syntactic disambiguation be treated sequentially or as a single problem? We describe several efficient,
probabilistically-interpretable ways to apply joint inference to morphological and syntactic disambiguation using
lattice parsing. Joint inference is shown to compare
favorably to pipeline parsing methods across a variety of
component models. State-of-the-art performance on Hebrew Treebank
parsing is demonstrated using the new method. The benefits of joint inference
are modest with the current component models, but appear to increase as components themselves
improve.
@inproceedings{cohen-07b,
author = "S. B. Cohen and N. A. Smith",
title = "Joint Morphological and Syntactic Disambiguation",
booktitle = "Proceedings of {EMNLP}",
year = "2007"
}
- Feature Selection Via Coalitional Game Theory, Shay B. Cohen, Gideon Dror and Eytan Ruppin, In Neural Computation 19:7, 2007 [pdf] [bibtex]
@article{cohen-07a,
author = "S. B. Cohen and G. Dror and E. Ruppin",
title = "Feature Selection via Coalitional Game Theory",
journal = "Neural Computation",
year = "2007"
}
- Feature Selection Based on the Shapley Value, Shay B. Cohen, Gideon Dror and Eytan Ruppin, In IJCAI 2005 [pdf] [bibtex]
@inproceedings{cohen-05,
author = "S. B. Cohen and G. Dror and E. Ruppin",
title = "Feature Selection Based on the {Shapley} Value",
booktitle = "Proceedings of {IJCAI}",
year = "2005"
}
Technical Reports and Others
- Unsupervised Bilingual POS Tagging with Markov Random Fields, Desai Chen, Chris Dyer, Shay B. Cohen and Noah A. Smith, In EMNLP Workshop on Unsupervised Learning in NLP, 2011 [pdf]
- Social Links from Latent Topics in Microblogs, Kriti Puniyani, Jacob Eisenstein, Shay B. Cohen and Eric P. Xing, In NAACL Workshop on Social Media, 2010
- The Shared Logistic Normal Distribution for Grammar Induction, Shay B. Cohen and Noah A. Smith, In NIPS Workshop on Speech and Language: Unsupervised Latent-Variable Models, 2008 [pdf of NAACL paper]
- Products of Weighted Logic Programs, Shay B. Cohen, Robert J. Simmons and Noah A. Smith, Technical Report, CMU-LTI-08-009 [pdf of TPLP paper]
Teaching
Software
- dageem - code for unsupervised grammar induction using logistic normal prior
News: the code has been completely rewritten in Java, and includes some extensions. You can check it out from the Google code repository here.
- variational inference for adaptor grammars (soon - email me if you want to be notified when put online.)
Last modified: June, 2012.
|