COMS W4115
Programming Languages and Translators
Lecture 5: Implementing a Lexical Analyzer
February 4, 2015

Outline

Variants of finite automata are commonly used to match regular expression patterns. This lecture summarizes the relevant results linking finite automata and regular expressions usually covered in computer science theory courses. Of particular relevance to the implementation of lexers are the algorithms that use deterministic and nondeterministic finite automata to match regular expression patterns.
A nondeterministic finite automaton (NFA) consists of

A finite set of states S.
An input alphabet consisting of a finite set of symbols Σ.
A transition function δ that maps S × (Σ ∪ {ε}) to subsets of S. This transition function can be represented by a transition graph in which the nodes are labeled by states and there is a directed edge labeled a from node w to node v if δ(w, a) contains v.
An initial state s₀ in S.
F, a subset of S, called the final (or accepting) states.

An NFA accepts an input string x iff there is a path in the transition graph from the initial state to a final state that spells out x.
The language defined by an NFA is the set of strings accepted by the NFA.
A deterministic finite automaton (DFA) is an NFA in which

There are no ε moves, and
For each state s and input symbol a there is exactly one transition out of s labeled a.

Every NFA can be converted to an equivalent DFA using the subset construction (Algorithm 3.20, ALSU, pp. 153-154).
Every DFA can be converted into an equivalent minimum-state DFA Using Algorithm 3.39, ALSU, pp. 181-183. All equivalent minimum-state DFAs are isomorphic up to state renaming.

Regular expressions and finite automata define the same class of languages, namely the regular sets.
Every regular expression can be converted into an equivalent NFA using the McNaughton-Yamada-Thompson algorithm (Algorithm 3.23, ALSU, pp. 159-161).
Every finite automaton can be converted into a regular expression using Kleene's algorithm.

The pumping lemma allows us to prove certain languages, like { aⁿbⁿ | n ≥ 0 }, are not regular.
The pumping lemma. If L is a regular language, then there exists a constant n associated with L such that for every string w in L where |w| ≥ n, we can partition w into three strings xyz (i.e., w = xyz) such that
- y is not the empty string,
- the length of xy is less than or equal to n, and
- for all k ≥ 0, the string xy^kz is in L.

Given a regular expression r and a string w, it is decidable whether r matches w.
Give a finite automaton A, it is decidable whether L(A) is empty.
Given two finite automata A and B, it is decidable whether L(A) = L(B).

Construct a deterministic finite automaton that will recognize all strings of 0's and 1's representing integers that are divisible by 3. Assume the empty string represents 0.
Use the McNaughton-Yamada-Thompson algorithm to convert the regular expression a(a|b)*a into a nondeterministic finite automaton.
Convert the NFA of (3) into a DFA.
Minimize the number of states in the DFA of (4).

ALSU Chapter 3, all sections except 3.9.
Russ Cox's article Regular Expression Matching Can Be Simple and Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...) has a good historical account on the evolution of regular expression matching programs.