COMS W4115
Programming Languages and Translators
Lecture 8: Context-Free Grammars
February 17, 2014
Lecture Outline
- Context-free grammars
- Derivations and parse trees
- Ambiguity
- Examples of context-free grammars
- Yacc: a language for specifying syntax-directed translators
1. Context-Free Grammars (CFG's)
- CFG's are very useful for representing the syntactic structure
of programming languages.
- A CFG is sometimes called Backus-Naur Form (BNF).
- A context-free grammar consists of
- A finite set of terminal symbols,
- A finite nonempty set of nonterminal symbols,
- One distinguished nonterminal called the start symbol, and
- A finite set of rewrite rules, called productions, each of the form
A → α
where A is a nonterminal and α is a string (possibly empty)
of terminals and nonterminals.
- Consider the context-free grammar G with the productions
E → E + T | T
T → T * F | F
F → ( E ) | id
- The terminal symbols are the alphabet from which strings are formed.
In this grammar the set of terminal symbols is
{ id, +, *, (, ) }. The terminal symbols are the token names.
- The nonterminal symbols are syntactic variables that denote sets
of strings of terminal symbols. In this grammar the set of nonterminal
symbols is {
E
, T
, F
}.
- The start symbol is
E
.
2. Derivations and Parse Trees
- L(G), the language generated by a grammar G, consists of all strings of
terminal symbols that can be derived from the start symbol of G.
- A leftmost derivation expands the leftmost nonterminal in
each sentential form:
E ⇒ E + T
⇒ T + T
⇒ F + T
⇒ id + T
⇒ id + T * F
⇒ id + F * F
⇒ id + id * F
⇒ id + id * id
A rightmost derivation expands the rightmost nonterminal in each sentential form:
E ⇒ E + T
⇒ E + T * F
⇒ E + T * id
⇒ E + F * id
⇒ E + id * id
⇒ T + id * id
⇒ F + id * id
⇒ id + id * id
Note that these two derivations have the same parse tree.
3. Ambiguity
- Consider the context-free grammar G with the productions
E → E + E | E * E | ( E ) | id
This grammar has the following leftmost derivation for
id + id * id
E ⇒ E + E
⇒ id + E
⇒ id + E * E
⇒ id + id * E
⇒ id + id * id
This grammar also has the following leftmost derivation for
id + id * id
E ⇒ E * E
⇒ E + E * E
⇒ id + E * E
⇒ id + id * E
⇒ id + id * id
These derivations have different parse trees.
A grammar is ambiguous if there is a sentence with two
or more parse trees.
The problem is that the grammar above does not specify
- the precedence of the + and * operators, or
- the associativity of the + and * operators
However, the grammar in section (1) generates the same language
and is unambiguous because
it makes * of higher precedence than +, and makes both operators
left associative.
A context-free language is inherently ambiguous if it
cannot be generated by any unambiguous context-free grammar.
The context-free language
{ ambmanbn
| m > 0 and n > 0} ∪
{ ambnanbm
| m > 0 and n > 0}
is inherently ambiguous.
Most (all?) natural languages are inherently ambiguous but no
programming languages are inherently ambiguous.
Unfortunately, there is no algorithm to determine whether a CFG is ambiguous;
that is, the problem of determining whether a CFG is ambiguous is undecidable.
We can, however, give some practically useful sufficient conditions to guarantee that a CFG
is unambiguous.
4. Examples of Context-Free Grammars
- Nonempty palindromes of
a
's and b
's.
(A palindrome is a string that reads the same forwards as backwards;
e.g., abba
.)
- CFG:
S → a S a | b S b | a a | b b | a | b
- Note that the language generated by this grammar is not regular.
Can you prove this using the pumping lemma for regular languages?
- Strings with an equal number of
a
's and b
's:
- CFG:
S → a S b | b S a | S S | ε
- Note that this grammar is ambiguous.
Can you find an equivalent unambiguous grammar?
- If- and if-else statements:
stmt → if ( expr ) stmt else stmt
| if (expr) stmt
| other
Note that this grammar is ambiguous.
Some typical programming language constructs:
stmt → expr ;
| if (expr) stmt
| for ( optexpr; optexpr; optexpr;) stmt
| other
optexpr → ε
| expr
5. Yacc: a Language for Specifying Syntax-Directed Translators
- Yacc is popular language, created by
Steve Johnson of Bell Labs, for specifying and implementing syntax-directed
translators.
- Bison is a gnu version of Yacc, upwards compatible with the original Yacc,
written by Charles Donnelly and Richard Stallman.
Many other versions of Yacc are also available.
- The original Yacc used C for semantic actions. Yacc has been rewritten for
many other languages including Java, ML, OCaml, and Python.
- Yacc specifications
- A Yacc program has three parts:
declarations
%%
translation rules
%%
supporting C-routines
The declarations part may be empty and the last part (%%
followed by the supporting C-routines) may be omitted.
Here is a Yacc program for a desk calculator
that adds and multiplies numbers.
(See ALSU, p. 292, Fig. 4.59 for a more advanced desk calculator.)
%{
#include <ctype.h>
#include <stdio.h>
#define YYSTYPE double
%}
%token NUMBER
%left '+'
%left '*'
%%
lines : lines expr '\n' { printf("%g\n", $2); }
| lines '\n'
| /* empty */
;
expr : expr '+' expr { $$ = $1 + $3; }
| expr '*' expr { $$ = $1 * $3; }
| '(' expr ')' { $$ = $2; }
| NUMBER
;
%%
/* the lexical analyzer; returns <token-name, yylval> */
int yylex() {
int c;
while ((c = getchar()) == ' ');
if ((c == '.') || (isdigit(c))) {
ungetc(c, stdin);
scanf("%lf", &yylval);
return NUMBER;
}
return c;
}
The declarations
%left '+'
%left '*'
make the operator +
left associative and of lower
precedence than the left-associative operator *
.
On Linux, we can make a desk calculator from this Yacc program
as follows:
- Put the yacc program in a file, say
desk.y
.
- Invoke
yacc desk.y
to create the yacc output file y.tab.c
.
- Compile this output file with a C compiler by typing
gcc y.tab.c -ly
to get a.out
.
(The library -ly contains the Yacc parsing program.)
a.out
is the desk calculator. Try it!
Note: With some versions of bison (bison 2.5 or later), this yacc program may generate
may trigger a bison bug producing an error message saying "In yyerror undefined
reference to 'rpl_fprintf'." One way to work around this bug is to define your own
yyerror function in the supporting C routines part of the yacc specification. Here is
one version of a yyerror function that suppresses this bug:
int yyerror(char const *message) {
fputs(message, stderr);
fputc('\n', stderr);
return 0;
}
6. Practice Problems
- Let G be the grammar
S → a S b S | b S a S | ε.
- What language is generated by this grammar?
- Draw all parse trees for the sentence
abab
.
- Is this grammar ambiguous?
- Let G be the grammar
S → a S b | ε.
Prove that L(G) =
{
a
nb
n | n ≥ 0 }.
- Consider a sentence of the form
id + id + ... + id
where there are
n plus signs. Let G be the grammar in section (3) above.
How many parse trees are there in G for this sentence when n equals
- 1
- 2
- 3
- 4
- m?
- Write down a CFG for regular expressions over the alphabet {
a
, b
}.
Show a parse tree for the regular expression a | b*a
.
7. Reading
aho@cs.columbia.edu