About regular expressions in egrep.

Written by Zeph Grunschlag

 

egrep is an acronym for "Extended Global Regular Expressions Print". It is a program which scans a specified file line by line, returning lines that contain a pattern matching a given regular expression.

The standard egrep command looks like:

egrep <flags> '<regular expression>' <filename>

Some common flags are: -c for counting the number of successful matches and not printing the actual matches, -i to make the search case insensitive, -n to print the line number before each match printout, -v to take the complement of the regular expression (i.e. return the lines which don't match), and -l to print the filenames of files with lines which match the expression.

Regular expressions come in their own egrep variant (this is very similar to the regular expressions found in emacs, perl, etc.)

operation

egrep notation

egrep usage

theoretical equivalent

union/or

|

011|1101

{011,1101}

Kleene star

*

(011|1101)*

{011,1101}*

Kleene plus

+

(011|1101)+

{011,1101}+

may or may not appear

?

(011)?

{ε,011}

( The Greek letter ε (epsilon) represents the empty string)

One of the main differences between egrep regular expressions and theoretical regular expressions is that in egrep, matches are allowed to occur anywhere within the string, while in the theoretical usage, matches always start from the first character of the string and end at the last character. For example, consider the string 000001000. In egrep, the regular expression 010 gives a match; on the other hand, the theoretical regular expression 010 does not match 000001000 because 010 and 000001000 are not equal. The theoretical equivalent of egrep's 010 is (0+1)*010(0+1)*. What if you actually want to consider the beginning and ending of strings? egrep provides you with the caret symbol ^ for specifying the beginning, and the dollar symbol $ for the ending. So the egrep equivalent of the theoretical 010 is given by ^010$ .

ther useful egrep symbol pair are the word boundaries \< and \> which respectively denote the beginning and ending of a word.

To specify a set or range of characters use braces. To negate the set, use the hat symbol ^ as the first character. For example

A few of examples:

 

match all lines in searchfile.txt which start with a non-empty bitstring, followed by a space, followed by a non-empty alphabetic word which ends the line

 

count the number of lines in lots_o_bits which either start with 1 or end with 01

 

count the number of lines with at least eleven 1's

 

list all the lines in myletter.txt containing the word the insensitive of case.