Regular Expressions

A regular expression is a string template against which you can match a piece of text. They are something like shell wildcard expressions, but much more powerful.

Examples of Regular Expressions

This bit of code loops through each line of a file. Finds all lines containing an EcoRI site, and bumps up a counter:

Code:

#!/usr/bin/perl -w
#file: EcoRI1.pl

use strict;

my $filename = "example.fasta";
open (FASTA , "$filename") or print "$filename does not exist\n";
my $sites;

while (my $line = <FASTA>) {
  chomp $line;

  if ($line =~ /GAATTC/){ 
    print "Found an EcoRI site!\n";
    $sites++;
  }
}

if ($sites){
  print "$sites EcoRI sites total\n";
}else{
  print "No EcoRI sites were found\n";
}

#note: if $sites is declared inside while loop you would not be able to
#print it outside the loop

Output:
~]$ ./EcoRI1.pl
Found an EcoRI site!
Found an EcoRI site!
.
.
.
Found an EcoRI site!
Found an EcoRI site!
34 EcoRI sites total


This Works Too!

Code:
#file:EcoRI2.pl

while ( <FASTA> ) {
  chomp;
  if ($_ = /GAATTC/){
    print "Found an EcoRI site!\n";
    $sites++;
  }
}
Output:
~]$ ./EcoRI1.pl
Found an EcoRI site!
Found an EcoRI site!
.
.
.
Found an EcoRI site!
Found an EcoRI site!
34 EcoRI sites total


This Also Works

Code:
#file:EcoRI.pl

while ( <FASTA> ) {
  chomp;
  if (/GAATTC/){  			
    print "Found an EcoRI site!\n";
    $sites++;
  }
}
By default, a regular expression examines $_ and returns a TRUE if it matches, FALSE otherwise.
Output:
~]$ ./EcoRI1.pl
Found an EcoRI site!
Found an EcoRI site!
.
.
.
Found an EcoRI site!
Found an EcoRI site!
34 EcoRI sites total

This does the same thing, but counts one type of methylation site (Pu-C-X-G) instead:

Code:
#file:methy.pl

while (<FASTA>) {
  	chomp;
 	  
  	if (/[GA]C.?G/){    #What Happens If Your File Is Not All In CAPS
    	#print "Found a Methylation Site!\n";
    	$sites++;
  	}
}
if ($sites){
	print "$sites Methylation Sites total\n";
}else{
	print "No Methylation Sites were found\n";
}

  
Output:
~]$ ./methy.pl
723 Methylation Sites total

Regular Expression Variable

A regular expression is normally delimited by two slashes ("/"). Everything between the slashes is a pattern to match. Patterns can be made up of the following Atoms:

  1. Ordinary characters: a-z, A-Z, 0-9 and some punctuation. These match themselves.

  2. The "." character, which matches everything except the newline.

  3. A bracket list of characters, such as [AaGgCcTtNn], [A-F0-9], or [^A-Z] (the last means anything BUT A-Z).

  4. Certain predefined character sets:
    \d
    The digits [0-9]
    \w
    A word character [A-Za-z_0-9]
    \s
    White space [ \t\n\r]
    \D
    A non-digit
    \W
    A non-word
    \S
    Non-whitespace
  5. Anchors:
    ^
    Matches the beginning of the string
    $
    Matches the end of the string
    \b
    Matches a word boundary (between a \w and a \W)

Examples:

Quantifiers

By default, an atom matches once. This can be modified by following the atom with a quantifier:

?
atom matches zero or exactly once
*
atom matches zero or more times
+
atom matches one or more times
{3}
atom matches exactly three times
{2,4}
atom matches between two and four times, inclusive
{4,}
atom matches at least four times

Examples:

Alternatives and Grouping

A set of alternative patterns can be specified with the | symbol:

/wolf|sheep/;   # matches "wolf" or "sheep"

/big bad (wolf|sheep)/;   # matches "big bad wolf" or "big bad sheep"

You can combine parenthesis and quantifiers to quantify entire subpatterns:

/Who's afraid of the big (bad )?wolf\?/;
# matches "Who's afraid of the big bad wolf?" and
#         "Who's afraid of the big wolf?"

This also shows how to literally match the special characters -- put a backslash (\) in front of them.

Specifying the String to Match

Regular expressions will attempt to match $_ by default. To specify another string variable, use the =~ (binding) operator:

$h = "Who's afraid of Virginia Woolf?";
print "I'm afraid!\n" if $h =~ /Woo?lf/;

There's also an equivalent "not match" operator !~, which reverses the sense of the match:

$h = "Who's afraid of Virginia Woolf?";
print "I'm not afraid!\n" if $h !~ /Woo?lf/;

Using a Different Delimiter

If you want to match slashes in the pattern, you can backslash them:

$file = '/usr/local/blast/cosmids.fasta';
print "local file" if $file =~ /^\/usr\/local/;

This is ugly, so you can specify any match delimiter with the m (match) operator:

$file = '/usr/local/blast/cosmids.fasta';
print "local file" if $file =~ m!^/usr/local!;

The punctuation character that follows the m becomes the delimiter. In fact // is just an abbreviation for m//. Almost any punctuation character will work:

The last two examples show that you can use left-right bracket pairs as well.

Matching with a Variable Pattern

You can use a scalar variable for all or part of a regular expression. For example:

$pattern = '/usr/local';
print "matches" if $file =~ /^$pattern/;
See the o flag for important information about using variables inside patterns.
<< Previous
Contents >> Next >>

Lincoln D. Stein, lstein@cshl.org
Cold Spring Harbor Laboratory
Last modified: Sun Oct 15 20:59:43 EDT 2000