Lecture 7: Context-Free Grammars

- We now begin the study of context-free languages, the next class of languages in the Chomsky hierarchy that properly includes the class of regular languages.
- A context-free language is defined by a context-free grammar, a formalism that generates the strings in the language defined by the grammar.
- Context-free grammars are a key formalism for describing the syntactic structure of programming languages. They are also useful in the study of natural languages.
- A
*context-free grammar*(CFG)*G*has four components (*V*,*T*,*P*,*S*) where *V*is a finite set of variables called nonterminals, or sometimes syntactic categories.*T*is a finite set of symbols called terminals.- The set of terminals is the alphabet of the language
defined by the grammar.
*P*is a finite set of productions, rewrite rules of the form*A*→ α where*A*is a nonterminal called the head of the production and α is a string (possibly empty) of nonterminals and terminals called the body of the production.*S*is a nonterminal, called the start symbol. - Example context-free grammar G1:
*V*= {`S`

}*T*= { ( , ) }*P*is the set with the two productions`S → S ( S ) S → ε`

`S`

is the start symbol.- We shall see that G1 generates the language consisting of all strings of balanced parentheses.

- A grammar defines a language. One common way is by means of derivations.
- Let α
*A*β be a string of terminals and nonterminals of a grammar*G*and let*A*→ γ be a production of*G*. We can use this production to rewrite*A*by γ in α*A*β to create the string αγβ. We define a relation ⇒ (read derives) on strings of terminals and nonterminals and write α*A*β ⇒ αγβ to model this rewriting step. This rewrite is one step of a derivation. - A
*derivation*is a sequence of zero or more consecutive rewrite steps. We will use ⇒* to represent the reflexive and transitive closure of ⇒. We will also use ⇒^{n}to represent an*n*-step derivation,*n*≥ 0. - A string of consisting only of terminals (possibly empty) that can be derived
from the start symbol of
*G*is called a sentence of*G*. - A string of terminals and nonterminals that can be derived from the start symbol of a grammar is called a sentential form.
- L(
*G*), the*language defined by**G*, is the set of all sentences that can be derived from the start symbol of*G*. - Here is an example of a derivation of the string
`( )( )`

from`S`

in G1:

```
S ⇒ S ( S )
⇒ S ( S ) ( S )
⇒ ( S ) ( S )
⇒ ( ) ( S )
⇒ ( ) ( )
```

`( )( )`

is sentence in the
language defined by G1. In each step of the derivation a nonterminal symbol `S`

in a sentential form has been replaced by the body of a production
that has `S`

on the left hand side.`S ⇒* ( ) ( )`

and
`S ⇒`^{5} ( ) ( )

.- A derivation in which at each step we replace the leftmost nonterminal by one of its production bodies is called a leftmost derivation.
- The derivation above is a leftmost derivation of
`( )( )`

from`S`

in G1. - A rightmost derivation is one in which at each step we replace the rightmost nonterminal by one of its production bodies.
- Here is a rightmost derivation of
`( )( )`

from`S`

in G1:

```
S ⇒ S ( S )
⇒ S ( )
⇒ S ( S ) ( )
⇒ S ( ) ( )
⇒ ( ) ( )
```

- A derivation can be represented by a parse tree.
- Let
*G*= (*V*,*T*,*P*,*S*) be a CFG. A parse tree for*G*is a tree in which: - Each interior node is labeled by a nonterminal in
*V*. - Each leaf is labeled by a nonterminal, or a terminal, or ε.
- If an interior node is labeled by a nonterminal
*A*and its children are labeled*X*_{1},*X*_{2}, …,*X*_{k}, then*A*→*X*_{1}*X*_{2}…*X*_{k}is a production in*P*. - The
*yield*of a parse tree is the string obtained by concatenating the labels of its leaves from left to right. - A parser for a grammar
*G*is a program that takes as input a string and produces as output a parse tree for the string or a message saying that the string cannot be generated by*G*. - A parser generator is a program that takes as input a grammar
*G*and produces as output a parser for*G*. YACC is a widely used bottom-up parser generator.

- A grammar
*G*is ambiguous if there is a sentence in L(*G*) with two or more distinct parse trees. - The following grammar G2 for arithmetic expressions is ambiguous
because
`a + a * a`

has two parse trees.

`E → E + E | E * E | ( E ) | a`

`+`

and `*`

.`*`

have higher precedence than `+`

and makes both
`*`

and `+`

left associative.```
E → E + T | T
T → T * F | F
F → ( E ) | a
```

- Construct a CFG that generates the language
{
`a`

|^{n}b^{n}`n`

≥ 0 }. - Prove that the language generated by the grammar G1 in section 2 consists of all and only all strings of balanced parentheses.
- Construct a CFG that generates ELP = {
`ww`

|^{R}`w`

is any string of`a`

's and`b`

's }. This is the language of even-length palindromes over the alphabet {`a`

,`b`

}. A palindrome is a string that reads the same in both directions. - Prove that ELP is not a regular language.
- Construct an unambiguous CFG for all regular expressions over the alphabet {a, b}.

- HMU: Ch. 5

aho@cs.columbia.edu verma@cs.columbia.edu