COMS W4115
Programming Languages and Translators
Lecture 6: The Lexical-Analyzer Generator Lex
February 11, 2008

Lecture Outline

Review
Issues for a the lexical analyzer
Tokens/patterns/lexemes/attributes
Specifying a lexical analyzer with Lex
Creating a lexical processor with Lex
Lex history
Finite automata

1. Review

The lexical analyzer
Basic definitions from language theory
Regular expressions
Construct regular expressions for:

All strings of a's and b's that contain the substring abb.
All strings of a's and b's that do not contain the substring abb.
All strings of a's and b's that contain the subsequence abb.
All strings of a's and b's that do not contain the subsequence abb.
All strings of a's, b's, and c's with no repeated adjacent letters.
All strings of a's, b's, and c's in which the letters are in nondecreasing lexicographic order.
IP addresses.
English words ending in "dous".
English words containing the five vowels in order.
C comments. See Exercise 3.3.5(c).

2. Issues for a Lexical Analyzer

Separation of lexical analysis from parsing

simplicity
efficiency
portability

Called by the parser: get next token

need for lookahead

Buffered reads for efficiency
Coping with lexical errors

types of lexical errors

insertion/deletion/replacement/transposition

edit distance
panic mode of error recovery

3. Tokens/Patterns/Lexemes/Attributes

a token is a pair consisting of a token name and an optional attribute value.

e.g., <id, ptr to symbol table>, <=>

a pattern is a description of the form that the lexemes making up a token in a source program may have.

We will use regular expressions to denote patterns.
e.g., identifiers in C: [_A-Za-z][_A-Za-z0-9]*

lexeme is a sequence of characters that matches the pattern for a token, e.g.,

identifiers: count, x1, i, position
keywords: if
operators: =, ==, !=, +=

an attribute of a token is usually a pointer to the symbol table entry that gives additional information about the token, such as its type, value, line number, etc.

4. Specifying a Lexical Analyzer with Lex

Lex is a special-purpose programming language for creating programs to lexically process streams of input characters.
Lex is ideally suited for contructing lexical analyzers, especially for parsers generated by yacc.
A Lex program has the following form:

	declarations
	%%
	translation rules
	%%
	auxiliary functions

The declarations section can contain declarations of variables, manifest constants, and regular definitions. The declarations section can be empty.
The translation rules are each of the form

	pattern	{action}

Each pattern is a regular expression which may use regular definitions defined in the declarations section.
Each action is a fragment of C-code.

The auxiliary functions section starting with the second %% is optional. Everything in this section is copied directly to the file lex.yy.c and can be used in the actions of the translation rules.

Example 1: Lex program to print all words in an input stream

The following Lex program will print all alphabetic words in an input stream:


	%%
	[A-Za-z]+	{ printf("%s\n", yytext); }
	.|\n		{ }

The pattern part of the first translation rule says that if the current prefix of the unprocessed input stream consists of a sequence of one or more letters, then the longest such prefix is matched and assigned to the Lex string variable yytext. The action part of the first translation rule prints the prefix that was matched. If this rule fires, then the matching prefix is removed from the beginning of the unprocessed input stream.
The dot in pattern part of the second translation rule matches any character except a newline at the beginning of the unprocessed input stream. The \n matches a newline at the beginning of the unprocessed input stream. If this rule fires, then the character of the beginning of the unprocessed input stream is removed. Since the action is empty, no output is generated.
Lex repeated applies these two rules until the input stream is exhausted.

Example 2: Lex program to print number of words, numbers, and lines in a file



         int num_words = 0, num_numbers = 0, num_lines = 0;
word     [A-Za-z]+
number   [0-9]+
%%
{word}   {++num_words;}
{number} {++num_numbers;}
\n       {++num_lines; }
.        { }
%%
int main()
{
  yylex();
  printf("# of words = %d, # of numbers = %d, # of lines = %d\n",
         num_words, num_numbers, num_lines );
}

Example 3: Lex program for Pascal-like programming language tokens

See ALSU, Fig. 3.23, p. 143.



%{ /* definitions of manifest constants */
   LT, LE,
   IF, ELSE, ID, NUMBER, RELOP */

/* regular definitions */
delim	 [ \t\n]
ws	 {delim}+
letter	 [A-Za-z]
digit    [0-9]
id	 {letter}({letter}|{digit})*
number	 {digit}+(\.{digit}+)?(E[+-]?{digit}+)?
%%
{ws}	 { }
if	 {return(IF);}
else	 {return(ELSE);}
{id}	 {yylval = (int) installID(); return(ID);}
{number} {yylval = (int) installNum(); return(NUMBER);}
"<"	 {yylval = LT; return(RELOP); }
"<="	 {yylval = LE; return(RELOP); }
%%
int installID()
{
  /* function to install the lexeme, whose first character
     is pointed to by yytext, and whose length is yyleng,
     into the symbol table; returns pointer to symbol table
     entry */
}

int installNum() {
  /* analogous to installID */
}

The global variable yylval is shared between the lexical analyzer and the yacc-generated parser.

5. Creating a Lexical Processor with Lex

Put lex program into a file, say lex.l.
Compile the lex program with the command:


lex lex.l

This command produces an output file lex.yy.c

.
  Compile this output file with the C compiler and the lex library
      -ll:

gcc lex.yy.c -ll

  The resulting a.out is the lexical processor.


 

 6. Lex History
 
  The initial UNIX version of lex was written by Michael Lesk at
      Bell Labs.
  The second version of lex with more efficient regular expression
      pattern matching was written by Eric Schmidt at Bell Labs.
  Vern Paxson wrote the POSIX-compliant variant of lex, called flex, at Berkeley.
  All versions of lex use variants of the regular-expression pattern-matching
      technology described in Chapter 3 of the dragon book.
  Today, many versions of lex can be found for C, C+, C#, Java, and other languages.
      

 7. Finite Automata
 
  Deterministic finite automata
  Nondeterministic finite automata
 

 8. Reading Assignment
 
  Read Chapter 3, all sections except 3.9.
  See 
      The Lex & Yacc Page
      for lex and flex tutorials and manuals.

 






aho@cs.columbia.edu

COMS W4115 Programming Languages and Translators Lecture 6: The Lexical-Analyzer Generator Lex February 11, 2008

Lecture Outline

1. Review

2. Issues for a Lexical Analyzer

3. Tokens/Patterns/Lexemes/Attributes

4. Specifying a Lexical Analyzer with Lex

Example 1: Lex program to print all words in an input stream

Example 2: Lex program to print number of words, numbers, and lines in a file

Example 3: Lex program for Pascal-like programming language tokens

5. Creating a Lexical Processor with Lex

6. Lex History

7. Finite Automata

8. Reading Assignment

COMS W4115
Programming Languages and Translators
Lecture 6: The Lexical-Analyzer Generator Lex
February 11, 2008