COMS W3261 CS Theory
Lecture 3: Regular Expressions
1. Nondeterministic Finite Automata with EpsilonTransitions
 An εNFA is an
NFA (Q, Σ, δ, q_{0}, F)
whose transition function δ is a mapping from
Q × (Σ ∪ {ε}) to P(Q), the set of subsets of
Q.
 The language of an εNFA is the set of all strings that
spell out a path from the start state to a final state.
There can be εtransitions along this path.
 Epsilonclosures
 We define ECLOSE(q), the εclosure of a state
q of an εNFA,
recursively as follows:
 State q is in ECLOSE(q).
 If state p is in ECLOSE(q), then all states
in δ(p, ε)
are also in ECLOSE(q).
 ECLOSE(q) is the set all states that can be reached from
state q using zero or more εtransitions.
 We can compute the εclosure of a set of
nondeterministic states S by taking the
union of the εclosures of all the individual states in S.
2. Converting an εNFA to an Equivalent DFA Using the Subset Construction
 We can eliminate all εtransitions from an εNFA
by converting it into an equivalent DFA using the subset construction.
 Given an εNFA E =
(Q_{E}, Σ, δ_{E},
q_{E},
F_{E}),
we construct the DFA D =
(Q_{D}, Σ, δ_{D},
q_{D},
F_{D})
as follows:
 Q_{D} = P(Q_{E}).
 δ_{D} is computed for all a in Σ
and S in P(Q_{E}) as follows:
 Let S = {p_{1}, p_{2}, …,
p_{k}} and let
{r_{1}, r_{2}, …, r_{m}}
be the union of
δ_{E}(p_{i}, a) for
i = 1, 2, …, k.
 Then, δ_{D}(S, a) =
ECLOSE({r_{1}, r_{2}, …,
r_{m}}).
 q_{D} = ECLOSE(q_{E}).
 F_{D} = { S 
S is in Q_{D} and
S contains a state in
F_{E} }.
 As with the subset construction for NFA's, we can prove by induction that
L(D) = L(E).
3. Regular Expressions
 Regular expressions are another very different formalism
for defining the regular languages.
 A regular expression E is an algebraic expression that denotes a language
L(E).
 Programming languages such as awk, java,
javascript, perl, python use regular expressions to match patterns in strings.
 There are differences in the regular expression notations used by various programming languages,
the most common variants being POSIX regular expressions and perlcompatible regular expressions.
 However, virtually all regularexpression notations have the operations
of union, concatenation, and Kleene closure.
We shall call regular expressions with just these three operators
Kleene regular expressions.
Kleene regular expressions
 We can specify Kleene regular expressions over an
alphabet Σ and the languages they denote using the following
inductive definition:
 Basis
 The constants ε and ∅ are regular expressions that denote
the languages { ε } and { }, respectively.
 A symbol c in Σ by itself is a regular expression that denotes the
language { c }.
 Induction: Let E and F be regular expressions
denoting the languages L(E) and L(F),
respectively.
 E + F is
a regular expression that denotes L(E) ∪ L(F).
 EF is
a regular expression that denotes L(E)L(F),
the concatenation of L(E) and L(F).
 E* is
a regular expression that denotes (L(E))*.
 (E) is
a regular expression that denotes L(E).
 Precedence and associativity of the regularexpression operators
 The regularexpression operator star has the highest precedence and is
left associative.
 The regularexpression operator concatenation has the next highest precedence and is
left associative.
 The regularexpression operator + has the lowest precedence and is
left associative.
 Thus the regular expression a + b*c would be grouped
a + ((b*)c).
 If a regular expression E denotes a language L
and a string w is in L, we will often say that
E matches w.
Examples of Kleene regular expressions and the languages they denote
 0*10* denotes the set of all strings of 0's and 1's containing a single 1.
 (0+1)*1(0+1)* denotes the set of all strings of 0's and 1's containing at
least one 1.
 1*(01*01*)* denotes the set of all strings of 0's and 1's containing an
even number of 0's.
 (a+b)*(abba+baab)(a+b)*
denotes the set of all strings of a's and b's
containing the substring abba
or the substring baab (or both).
4. POSIX Regular Expressions
 The IEEE standards group POSIX added a number of additional operators to Kleene regular expressions
to make it easier to specify languages. It also tried to standardize the different regularexpression
conventions used by various Unix utilities.
 Here is a list of some useful POSIX regularexpression operators
and the strings they match.
Some POSIX regular expression operators
 POSIX uses
?
to mean "zero or one instance of".
 The regular expression
a?b?c?
denotes the language
{ε, a, b, c, ab, ac, bc, abc
}. Thus a?b?c?
matches any of the eight strings in this language.
.
matches any character except a newline.
^
matches the empty string at the beginning of a line.
$
matches the empty string at the end of a line.
[abc]
matches an a
, b
, or c
.
[az]
matches any lowercase letter from a
to z
.
[AZaz09]
matches any alphanumeric character.
[^abc]
matches any character except an
a
, b
, or c
.
[^09]
matches any nonnumeric character.
a*
matches any string of zero or more a
's
(including the empty string).
a?
matches any string of zero or one a
's
(including the empty string).
a{2,5}
matches any string consisting of two to five a
's.
(a)
matches an a
.
 Note that in POSIX regular expressions the operator

(rather than +
) is used to denote union.
In POSIX regular expressions +
is a unary postfix operator
that matches one or more instances of its operand.
For example, [abc]+
matches any nonempty string
of a
's, b
's, and c
's.
\
is a metacharacter that turns off any special meaning of
the following character. For example, d\*g
matches the string d*g
.
Another example, \\
matches the string consisting of the
single character \
.
Examples of POSIX regular expressions and the strings they match
 The Unix command
egrep 'regexp' file
prints all lines in
file
that contain a substring matched by the
POSIX regular expression regexp
. Examples:
 The command
egrep 'dog' file
would print all lines in file
containing the substring
dog
.
 The command
egrep '^a?b?c?d?e?$' file
would print all lines in file
consisting of
an optional a
followed by
an optional b
followed by
an optional c
followed by
an optional d
followed by
an optional e
.
The metacharacters ^
and $
match the
empty strings at the beginning and end of a line, respectively.
 We can extend this example to find all words in a
wordlist in which the
letters are in strictly increasing alphabetic order.
(
aegilops
is the longest English word whose letters
are in strictly increasing alphabetic order.)
 When we say "regular expression" we mean a Kleene regular expression.
5. A detailed proof that a DFA defines a given language
 To prove that a DFA D defines a given language
L we need to show that every string in
L(D) is in L and that every string in L is in L(D).
Here is a detailed example of how
this can be done using induction for both parts.
 Consider the DFA D =
({
A, B
}, {0, 1}, δ, A
, {A
})
in Example 1 from Lecture 2 with start state A
,
set of final states {A
}, and the following transition
function δ:
State 
Input Symbol 
0 
1 
A 
B 
A 
B 
A 
B 
 Let L be the language consisting of all strings of 0’s and 1’s with an
even number
of zeros. We shall prove that L(D) = L.
 To prove L(D) is contained in L,
we shall prove the following inductive
hypothesis on n, the number of transitions made by D
processing a string w,
for n ≥ 0:
 Inductive Hypothesis IH1:
 (a) If δ^{n}(
A
, w) = A
,
then w has an even number of 0’s.
Here, δ^{n} means n transitions by D.
 (b) If δ^{n}(
A
, w) = B
,
then w has an odd number of 0’s.
 Basis. n = 0.
 Part (a). When n = 0, w must be the empty string
so by definition δ^{0}(
A
, ε) = A
.
Since the empty string has an even number of 0’s, part (a) is true.
 Part (b) is trivially true since
δ^{0}(
A
, ε) = B
is false.
 Induction. Assume the inductive hypothesis IH1 is true for n transitions
and suppose D makes n+1 transitions on a string w.
There are four cases to consider depending on whether w ends with a
0 or a 1 and whether D was in state
A
or state B
before processing the last symbol of w.
 Suppose w = x0 for some x and
δ^{n}(
A
, x) = A
.
From the inductive hypothesis we know that x has
an even number of 0’s so the string x0 has an odd number of 0's.
In this case D makes the sequence of transitions
δ^{n+1}(A
, x0)
= δ(A
, 0) = B
, satisfying
part (b) of the inductive hypothesis.
 Suppose w = x0 and
δ^{n}(
A
, x) = B
.
From the inductive hypothesis we know that x has
an odd number of 0’s so x0 has an even number of 0's.
In this case D makes the transitions
δ^{n+1}(A
, x0)
= δ(B
, 0) = A
, satisfying
part (a) of the inductive hypothesis.
 Suppose w = x1 and
δ^{n}(
A
, x) = A
.
From the inductive hypothesis we know that x has
an even number of 0’s so x1 still has an even number of 0's.
In this case D makes the transitions
δ^{n+1}(A
, x1)
= δ(A
, 1) = A
, satisfying
part (a) of the inductive hypothesis.
 Suppose w = x1 and
δ^{n}(
A
, x) = B
.
From the inductive hypothesis we know that x has
an odd number of 0’s so x1 still has an odd number of 0's.
In this case D makes the transitions
δ^{n+1}(A
, x1)
= δ(B
, 1) = B
, satisfying
part (b) of the inductive hypothesis.
 We have now shown that IH1 is true for all sequences of n moves, where
n ≥ 0.
Part (a) of IH1 implies that if D accepts a string w,
then w must have an even number of 0's.
This proves that L(D) is contained in L
 To prove L is contained in L(D), we need to show that
every string in L is accepted by D.
To do this, we shall prove the following inductive
hypothesis by induction on n, the length of an input string w,
for n ≥ 0:
 Inductive Hypothesis IH2:
 (a) If w has an even number of 0’s,
then δ*(
A
,w) = A
.
Here, δ* means zero or more moves by D.
(δ* is the same as deltahat in the text.)
 (b) If w has an odd number of 0’s,
then δ*(
A
,w) = B
.
 Basis. n = 0; that is, w = ε.
 The empty string has an even number of 0's.
D accepts the empty string since
the start state is a final state.
 Induction. Assume the inductive hypothesis
IH2 is true for all strings w of
length 0, 1, 2, . . . , n and consider a string
w of length n+1.
Here there are four cases to consider depending on whether w
ends with a 0 or a 1 and whether the string consisting of the first
n symbols of w has an even or odd number of 0's.
 Suppose w = x0 and x has an even number of 0’s.
From the inductive hypothesis IH2,
δ*(
A
,x) = A
.
Since since x0 has an odd number of 0's and D makes the
sequence of moves
δ*(A
,x0) =
δ(A
, 0) = B
,
this case satisfies part (b) of IH2.
 Suppose w = x0 and x has an odd number of 0’s.
From IH2,
δ*(
A
,x) = B
.
Since since x0 has an even number of 0's and D makes the
sequence of moves
δ*(A
,x0) =
δ(B
, 0) = A
,
this case satisfies part (a) of IH2.
 Suppose w = x1 and x has an even number of 0’s.
From IH2, δ*(
A
,x) = A
.
Since since x1 has an even number of 0's and D makes the
sequence of moves
δ*(A
,x1) =
δ(A
, 1) = A
,
this case satisfies part (a) of IH2.
 Suppose w = x1 and x has an odd number of 0’s.
From IH2, δ*(
A
,x) = B
.
Since since x1 has an odd number of 0's and D makes the
sequence of moves
δ*(A
,x0) =
δ(B
, 1) = B
,
this case satisfies part (b) of IH2.
 We have now shown that IH2 is true for all strings of length n, n ≥ 0.
 From the two inductive hypotheses, we can conclude that w is accepted by D
iff w is in L.
In other words, L(D) = L.
6. Practice Problems
 Construct an epsilonNFA that accepts all strings of
a
's
and b
's ending in abb
or in baa
.
 Show all sequences of moves that your NFA can make on the input string ababb.
 Use the subset construction to convert your epsilonNFA into an equivalent DFA.
 Do the two regular expressions
(a+b)* and (a*b*)* denote the same language?
 Write a Kleene regular expression for all strings of 0's and 1's with an
even number of 0's and an even number of 1's.
 Let L be the language
{
abxba
 x
is any string of a
's,
b
's, and c
's not containing ba
}.
This language models comments in the programming language C.
 Construct a regular expression for L.
 Explain how your regular expression defines the string
abcbaba
.
 Write a Kleene regular expression for all strings of a's
and b's that begin
and end with an a.
 Write a POSIX regular expression that matches all English words ending with the substring
dous
.
 Write a POSIX regular expression that matches all English words with the five vowels
a, e, i, o, u in order.
All five vowels have to appear in order but they do not have to be next
to one another. (Example:
facetious
.)
Note that vowels can be repeated. (Example: sacrilegious
.)
7. References
aho@cs.columbia.edu
verma@cs.columbia.edu