# COMS W3261 CS Theory Lecture 7: Context-Free Grammars

## 1. Definition of Context-Free Grammar

• We now begin the study of context-free languages, the next class of languages in the Chomsky hierarchy that properly includes the class of regular languages.
• A context-free language is defined by a context-free grammar, a formalism that generates the strings in the language defined by the grammar.
• Context-free grammars are a key formalism for describing the syntactic structure of programming languages. They are also useful in the study of natural languages.
• A context-free grammar (CFG) G has four components (V, T, P, S) where
• V is a finite set of variables called nonterminals, or sometimes syntactic categories.
• T is a finite set of symbols called terminals.
• The set of terminals is the alphabet of the language defined by the grammar.
• P is a finite set of productions, rewrite rules of the form A → α where A is a nonterminal called the head of the production and α is a string (possibly empty) of nonterminals and terminals called the body of the production.
• S is a nonterminal, called the start symbol.
• Example context-free grammar G1:
1. V = { S }
2. T = { ( , ) }
3. P is the set with the two productions
S → S ( S )
S → ε
4. S is the start symbol.
We shall see that G1 generates the language consisting of all strings of balanced parentheses.

## 2. Derivations Using a Grammar

• A grammar defines a language. One common way is by means of derivations.
• Let αAβ be a string of terminals and nonterminals of a grammar G and let A → γ be a production of G. We can use this production to rewrite A by γ in αAβ to create the string αγβ. We define a relation ⇒ (read derives) on strings of terminals and nonterminals and write αAβ ⇒ αγβ to model this rewriting step. This rewrite is one step of a derivation.
• A derivation is a sequence of zero or more consecutive rewrite steps. We will use ⇒* to represent the reflexive and transitive closure of ⇒. We will also use ⇒n to represent an n-step derivation, n ≥ 0.
• A string of consisting only of terminals (possibly empty) that can be derived from the start symbol of G is called a sentence of G.
• A string of terminals and nonterminals that can be derived from the start symbol of a grammar is called a sentential form.
• L(G), the language defined by G, is the set of all sentences that can be derived from the start symbol of G.
• Here is an example of a derivation of the string ( )( ) from S in G1:
• S ⇒ S ( S )
⇒ S ( S ) ( S )
⇒ ( S ) ( S )
⇒ ( ) ( S )
⇒ ( ) ( )
• This derivation shows that ( )( ) is sentence in the language defined by G1. In each step of the derivation a nonterminal symbol S in a sentential form has been replaced by the body of a production that has S on the left hand side.
• We can write S ⇒* ( ) ( ) and S ⇒5 ( ) ( ).

## 3. Leftmost and Rightmost Derivations

• A derivation in which at each step we replace the leftmost nonterminal by one of its production bodies is called a leftmost derivation.
• The derivation above is a leftmost derivation of ( )( ) from S in G1.
• A rightmost derivation is one in which at each step we replace the rightmost nonterminal by one of its production bodies.
• Here is a rightmost derivation of ( )( ) from S in G1:
• S ⇒ S ( S )
⇒ S (  )
⇒ S ( S ) ( )
⇒ S ( ) ( )
⇒ ( ) ( )

## 4. Parse Trees

• A derivation can be represented by a parse tree.
• Let G = (V, T, P, S) be a CFG. A parse tree for G is a tree in which:
• Each interior node is labeled by a nonterminal in V.
• Each leaf is labeled by a nonterminal, or a terminal, or ε.
• If an interior node is labeled by a nonterminal A and its children are labeled X1, X2, …, Xk, then AX1X2Xk is a production in P.
• The yield of a parse tree is the string obtained by concatenating the labels of its leaves from left to right.
• A parser for a grammar G is a program that takes as input a string and produces as output a parse tree for the string or a message saying that the string cannot be generated by G.
• A parser generator is a program that takes as input a grammar G and produces as output a parser for G. YACC is a widely used bottom-up parser generator.

## 5. Ambiguity

• A grammar G is ambiguous if there is a sentence in L(G) with two or more distinct parse trees.
• The following grammar G2 for arithmetic expressions is ambiguous because a + a * a has two parse trees.
• E → E + E | E * E | ( E ) | a
• We can remove the ambiguity by specifying the associativity and precedence of the + and *.
• The grammar G3 below is unambiguous and makes * have higher precedence than + and makes both * and + left associative.
• E → E + T | T
T → T * F | F
F → ( E ) | a
• A context-free language is inherently ambiguous if it cannot be generated by an unambiguous grammar.

## 6. Practice Problems

1. Construct a CFG that generates the language { anbn | n ≥ 0 }.
2. Prove that the language generated by the grammar G1 in section 2 consists of all and only all strings of balanced parentheses.
3. Construct a CFG that generates ELP = { wwR | w is any string of a's and b's }. This is the language of even-length palindromes over the alphabet {a, b}. A palindrome is a string that reads the same in both directions.
4. Prove that ELP is not a regular language.
5. Construct an unambiguous CFG for all regular expressions over the alphabet {a, b}.

• HMU: Ch. 5

aho@cs.columbia.edu
verma@cs.columbia.edu