Regular Expression Options

Regular expression matches and substitutions have a whole set of options which you can toggle on by appending one or more of the i, m, s, g, e or x modifiers to the end of the operation. See Programming Perl Page 153 for more information. Some example:
$string = 'Big Bad WOLF!';
print "There's a wolf in the closet!" if $string =~ /wolf/i;
# i is used for a case insensitive match

i
Case insensitive match.

g
Global match (see below).

e
Evalute right side of s/// as an expression.

o
Only compile variable patterns once (see below).

m
Treat string as multiple lines. ^ and $ will match at start and end of internal lines, as well as at beginning and end of whole string. Use \A and \Z to match beginning and end of whole string when this is turned on.

s
Treat string as a single line. "." will match any character at all, including newline.

x
Allow extra whitespace and comments in pattern.

Global Matches

Adding the g modifier to the pattern causes the match to be global. Called in a scalar context (such as an if or while statement), it will match as many times as it can.

This will match all codons in a DNA sequence, printing them out on separate lines:

Code:

  $sequence = 'GTTGCCTGAAATGGCGGAACCTTGAA';
  while ( $sequence =~ /(.{3})/g ) {
    print $1,"\n";
  }

Output:

GTT
GCC
TGA
AAT
GGC
GGA
ACC
TTG

If you perform a global match in a list context (e.g. assign its result to an array), then you get a list of all the subpatterns that matched from left to right. This code fragment gets arrays of codons in three reading frames:
@frame1 = $sequence =~ /(.{3})/g;
@frame2 = substr($sequence,1) =~ /(.{3})/g;
@frame3 = substr($sequence,2) =~ /(.{3})/g;

The position of the most recent match can be determined by using the pos function.
Code:
#file:pos.pl
my $seq = "XXGGATCCXX";

if ( $seq =~ /(GGATCC)/gi ){
        my $pos = pos($seq);
        print "Our Sequence: $seq\n";
        print '$pos = ', "1st postion after the match: $pos\n";
        print '$pos - length($1) = 1st postion of the match: ',($pos-length($1)),"\n";
        print '($pos - length($1))-1 = 1st postion before the the match: ',($pos-length($1)-1),"\n";
}

Output:
~]$ ./pos.pl
Our Sequence: XXGGATCCXX
$pos = 1st postion after the match: 8
$pos - length($&) = 1st postion of the match: 2
($pos - length($&))-1 = 1st postion before the the match: 1

Variable Interpolation and the "o" Modifier

If you use a variable inside a pattern template, as in /$pattern/ be aware that there is a small performance penalty each time Perl encounters a pattern it hasn't seen before. If $pattern doesn't change over the life of the program, then use the o ("once") modifier to tell Perl that the variable won't change. The program will run faster:
$codon = '.{3}';
@frame1 = $sequence =~ /($codon)/og;

Testings Your Regular Expressions

To be sure that you are getting what you think you want you can use the following "Magic" Perl Automatic Match Variables $&, $`, and $'
Code:
#file:matchTest.pl

if ("Hello there, neighbor" =~ /\s(\w+),/){
        print "That actually matched '$&'.\n";
        print "That was ($`) ($&) ($').\n";
}

Output:
That actually matched ' there,'.
That was (Hello) ( there,) ( neighbor).


<< Previous
Contents >>

Lincoln D. Stein, lstein@cshl.org
Cold Spring Harbor Laboratory
Last modified: Sat Oct 14 14:40:30 EDT 2000