Subpatterns

You can extract and manipulate subpatterns in regular expressions.

To designate a subpattern, surround its part of the pattern with parenthesis (same as with the grouping operator). This example has just one subpattern, (.+) :
 /Who's afraid of the big bad w(.+)f/

Matching Subpatterns

Once a subpattern matches, you can refer to it later within the same regular expression. The first subpattern becomes \1, the second \2, the third \3, and so on.

  while (<>) {
    chomp;
    print "I'm scared!\n" if /Who's afraid of the big bad w(.)\1f/
  }

This loop will print "I'm scared!" for the following matching lines:

but not

In a similar vein, /\b(\w+)s love \1 food\b/ will match "dogs love dog food", but not "dogs love monkey food".

Using Subpatterns Outside the Regular Expression Match

Outside the regular expression match statement, the matched subpatterns (if any) can be found the variables $1, $2, $3, and so forth.

Example. Extract 50 base pairs upstream and 25 base pairs downstream of the TATTAT consensus transcription start site:

  
  while (<>) {
    chomp;
    next unless /(.{50})TATTAT(.{25})/;
    my $upstream = $1;
    my $downstream = $2;
  }

Extracting Subpatterns Using Arrays

If you assign a regular expression match to an array, it will return a list of all the subpatterns that matched. Alternative implementation of previous example:
  
  while (<>) {
    chomp;
    my ($upstream,$downstream) = /(.{50})TATTAT(.{25})/;
  }

If the regular expression doesn't match at all, then it returns an empty list. Since an empty list is FALSE, you can use it in a logical test:

  
  while (<>) {
    chomp;
    next unless my($upstream,$downstream) = /(.{50})TATTAT(.{25})/;
    print "upstream = $upstream\n";
    print "downstream = $downstream\n";  
  }

Grouping without Making Subpatterns

Because parentheses are used both for grouping (a|ab|c) and for matching subpatterns, you may match subpatterns that don't want to. To avoid this, group with (?:pattern):
/big bad (?:wolf|sheep)/;

# matches "big bad wolf" or "big bad sheep",
# but doesn't extract a subpattern.

Subpatterns and Greediness

By default, regular expressions are "greedy". They try to match as much as they can. For example:

$h = 'The fox ate my box of doughnuts';
$h =~ /(f.+x)/;
$subpattern = $1;

Because of the greediness of the match, $subpattern will contain "fox ate my box" rather than just "fox".

To match the minimum number of times, put a ? after the qualifier, like this:

$h = 'The fox ate my box of doughnuts';
$h =~ /(f.+?x)/;
$subpattern = $1;
Now $subpattern will contain "fox". This is called lazy matching.

Lazy matching works with any quantifier, such as +?, *? and {2,50}?.


<< Previous
Contents >> Next >>

Lincoln D. Stein, lstein@cshl.org
Cold Spring Harbor Laboratory
Last modified: Tue Oct 12 18:03:46 EDT 1999