COMS W3261
Computer Science Theory
Lecture 2: September 10, 2012
Regular Expressions
Outline
- Review
- Regular expressions
- Examples of regular expressions
- Practice problems
1. Review: Operations on Languages
- Union
- L ∪ M = { w | w is in either L
or M }
- Concatenation
- LM = { xy | x is in L
and y is in M }
- Exponentiation
- L0 is { ε }, the set containing the empty string.
- Li = Li-1L, for i > 0
- Kleene closure
- L* = L0 ∪ L1 ∪
L2 ...
- Note that ∅* = ∅0 = { ε }.
- Example: {a, b}* is the set of all strings of a's and b's (including ε).
2. Regular Expressions
- A regular expression E is an algebraic expression that denotes a language
L(E).
- Programming languages such as awk, java,
javascript, perl, python use regular expressions to match patterns in strings.
- There are differences in the regular expression notations used by various programming languages,
the most common variants being POSIX regular expressions and perl-compatible regular expressions.
- Virtually all regular-expression notations have the operations
of union, concatenation, and Kleene closure.
We shall call regular expressions with just these three operators Kleene regular expressions.
Kleene regular expressions
- Inductive definition of Kleene regular expressions over an
alphabet Σ:
- Basis
- The constants ε and ∅ are regular expressions that denote
the languages { ε } and { }, respectively.
- A symbol c in Σ by itself is a regular expression that denotes the
language { c }.
- Induction: Let E and F be regular expressions.
- E + F is
a regular expression that denotes L(E) ∪ L(F).
- EF is
a regular expression that denotes L(E)L(F),
the concatenation of L(E) and L(F).
- E* is
a regular expression that denotes (L(E))*.
- (E) is
a regular expression that denotes L(E).
- If a regular expression E denotes a language L
and a string w is in L, we will often say that
E matches w.
- Precedence and associativity of the regular-expression operators
- The regular-expression operator star has the highest precedence and is
left associative.
- The regular-expression operator concatenation has the next highest precedence and is
left associative.
- The regular-expression operator + has the lowest precedence and is
left associative.
- Thus the regular expression a + b*c would be grouped a + ((b*)c.
Examples of Kleene regular expressions and the languages they denote
- 0*10* denotes the set of all strings of 0's and 1's containing a single 1.
- (0+1)*1(0+1)* denotes the set of all strings of 0's and 1's containing at
least one 1.
- (a+b)*abba(a+b)* denotes the set of all strings of a's and b's
containing the substring abba.
3. POSIX Regular Expressions
- The IEEE standards group POSIX added a number of additional operators to Kleene regular expressions
to make it easier to specify languages. It also tried to standardize the different regular-expression
conventions used by various Unix utilities.
- Here we list some of the more useful Posix regular-expression operators
operators and describe the strings they match.
Some POSIX regular expression operators
- Posix uses
? to mean "zero or one instance of".
- The regular expression
a?b?c? denotes the language
{ε, a, b, c, ab, ac, bc, abc}.Thus a?b?c?
matches any of the eight strings in this language.
. matches any character except a newline.
^ matches the empty string at the beginning of a line.
$ matches the empty string at the end of a line.
[abc] matches an a, b, or c.
[a-z] matches any lowercase letter from a
to z.
[A-Za-z0-9] matches any alphanumeric character.
[^abc] matches any character except an
a, b, or c.
[^0-9] matches any nonnumeric character.
a* matches any string of zero or more a's
(including the empty string).
a? matches any string of zero or one a's
(including the empty string).
a{2,5} matches any string consisting of two to five a's.
(a) matches an a.
- Note that in POSIX regular expressions the operator
|
(rather than +) is used to denote union.
In POSIX regular expressions + means one or more instances of.
\ is a metacharacter that turns off any special meaning of
the following character. For example, d\*g matches the string d*g.
Another example, \\ matches the string consisting of the
single character \.
Examples of Posix regular expressions and the strings they match
- The Unix command
egrep 'regexp' file prints all lines in file
that contain a substring matched by the regular expression regexp. Examples:
- The command
egrep 'dog' file
would print all lines in file containing the substring
dog.
- The command
egrep '^a?b?c?d?e?$' file
would print all lines in file consisting of the letters
a, b, c, d, e in
increasing alphabetic order.
The metacharacters ^ and $ match the
empty string at the beginning and end of a line, respectively.
aegilops is the longest English word whose letters
are in increasing alphabetic order.
4. Practice Problems
- Do the two regular expressions (a+b)* and (a*b*)* denote the same language?
- Write a Kleene regular expression for all strings of a's and b's with an
even number of a's.
- Write a Kleene regular expression for all strings of a's and b's that begin
and end with an a.
- Write a Posix regular expression that matches all English words ending in dous.
- Write a Posix regular expression that matches all English words with the five vowels
a,e,i,o,u in order.
(The vowels do not have to be next to one another.)
5. References
- HMU: Sects. 3.1, 3.3.1
- http://en.wikipedia.org/wiki/Regular_expression
aho@cs.columbia.edu