COMS W3261
Computer Science Theory
Lecture 7: September 26, 2012
Context-Free Grammars
Outline
- Review
- Definition of a context-free grammar
- Derivations
- Leftmost and rightmost derivations
- Parse trees
- Ambiguity
1. Review
- Closure properties of regular languages
- Decision problems for regular languages
- Testing equivalence of states
- Testing equivalence of DFA's
- Minimizing the number of states in a DFA
2. Definition of a Context-Free Grammar (CFG)
- A CFG is a formalism for defining a language.
- A CFG has four components (V, T, P, S):
- V is a finite set of variables called nonterminals,
sometimes called syntactic categories.
- Each variable represents a language.
- T is a finite set of symbols called terminals.
- The set of terminals is the alphabet of the language
defined by the grammar.
- P is a finite set of productions, rewrite rules of the form
A → α
- where A is a nonterminal and α is a string (possibly empty)
of nonterminals and terminals.
- S is a nonterminal, called the start symbol.
- Example grammar G1:
- V = {
S }
- T = { ( , ) }
- P is the set with the two productions
S → S ( S )
S → ε
- S is the start symbol.
- G1 generates the language consisting of all strings of balanced parentheses.
3. Derivations
- A grammar is used to define a language.
- Example of a derivation of
( )( ) from S in G1:
S ⇒ S ( S )
⇒ S ( S ) ( S )
⇒ ( S ) ( S )
⇒ ( ) ( S )
⇒ ( ) ( )
This derivation shows that ( )( ) is string in the
language defined by G1.
L(G), the set of all strings of terminals that can be derived
from the start symbol
of a grammar G, is the language defined by G.
We often call a string in L(G) a sentence of L(G).
A string of terminals and nonterminals that can be derived from
the start symbol of a grammar is called a sentential form.
4. Leftmost and Rightmost Derivations
- A derivation in which at each step we replace the leftmost nonterminal
by one of its production bodies is called a leftmost derivation.
- The derivation above is a leftmost derivation of
( )( )
from S in G1.
- A rightmost derivation is one in which at each step we replace the
rightmost nonterminal by one of its production bodies.
- Here is a rightmost derivation of
( )( ) from S
in G1:
S ⇒ S ( S )
⇒ S ( )
⇒ S ( S ) ( )
⇒ S ( ) ( )
⇒ ( ) ( )
5. Parse Trees
- A derivation can be represented by a parse tree.
- Let G = (V, T, P, S) be a CFG. A parse tree for G is a tree in which:
- Each interior node is labeled by a nonterminal in V.
- Each leaf is labeled by a nonterminal, or a terminal, or ε
- If an interior node is labeled by a nonterminal A and its children are
labeled X1, X2, ... , Xk, then
A → X1X2 ... Xk is a production in P.
- The yield of a parse tree is the string obtained by
concatenating the labels of the leaves from the left.
- Derivations, parse trees, leftmost derivations, rightmost derivations,
and recursive inference are equivalent.
- A parser for a grammar G is a program that takes as input a string
and produces as output a parse tree for the string or a message
saying that the string cannot be generated by G.
- A parser generator is a program that takes as input a grammar G
and produces as output a parser for G. YACC is a widely used
parser generator.
6. Ambiguity
- A grammar G is ambiguous if there is a sentence in L(G)
with two or more distinct parse trees.
- The following grammar G2 for arithmetic expressions is ambiguous
because
a + a * a has two parse trees.
E → E + E | E * E | ( E ) | a
We can remove the ambiguity by specifying the associativity
and precedence of the + and *.
The grammar G3 below is unambiguous and makes *
have higher precedence than + and makes both
* and + left associative.
E → E + T | T
T → T * F | F
F → ( E ) | a
A context-free language L is inherently ambiguous if it
cannot be generated by an unambiguous grammar.
7. Practice Problems
- Construct a CFG that generates the language
{
anbn | n ≥ 0 }.
- Prove that the language generated by the grammar G1 in section 2 consists of all
and only all strings of balanced parentheses.
- Construct a CFG that generates ELP = {
wwR | w
is any string of a's and b's }. This is the
language of even-length palindromes over the alphabet {a, b}.
A palindrome is a string that reads the same in both directions.
- Prove that ELP is not a regular language.
- Construct a CFG for all regular expressions over the alphabet {a, b}.
8. Reading Assignment
aho@cs.columbia.edu