You can extract and manipulate subpatterns in regular expressions.
To designate a subpattern, surround its part of the pattern with parenthesis (same as with the grouping operator). This example has just one subpattern, (.+) :
/Who's afraid of the big bad w(.+)f/ |
Once a subpattern matches, you can refer to it later within the same regular expression. The first subpattern becomes \1, the second \2, the third \3, and so on.
while (<>) {
chomp;
print "I'm scared!\n" if /Who's afraid of the big bad w(.)\1f/
} |
This loop will print "I'm scared!" for the following matching lines:
In a similar vein, /\b(\w+)s love \1 food\b/ will match "dogs love dog food", but not "dogs love monkey food".
Outside the regular expression match statement, the matched subpatterns (if any) can be found the variables $1, $2, $3, and so forth.
Example. Extract 50 base pairs upstream and 25 base pairs downstream of the TATTAT consensus transcription start site:
while (<>) {
chomp;
next unless /(.{50})TATTAT(.{25})/;
my $upstream = $1;
my $downstream = $2;
} |
If you assign a regular expression match to an array, it will return a list of all the subpatterns that matched. Alternative implementation of previous example:
while (<>) {
chomp;
my ($upstream,$downstream) = /(.{50})TATTAT(.{25})/;
} |
If the regular expression doesn't match at all, then it returns an empty list. Since an empty list is FALSE, you can use it in a logical test:
while (<>) {
chomp;
next unless my($upstream,$downstream) = /(.{50})TATTAT(.{25})/;
print "upstream = $upstream\n";
print "downstream = $downstream\n";
} |
Because parentheses are used both for grouping (a|ab|c) and for matching subpatterns, you may match subpatterns that don't want to. To avoid this, group with (?:pattern):
/big bad (?:wolf|sheep)/; # matches "big bad wolf" or "big bad sheep", # but doesn't extract a subpattern. |
By default, regular expressions are "greedy". They try to match as much as they can. For example:
$h = 'The fox ate my box of doughnuts'; $h =~ /(f.+x)/; $subpattern = $1; |
Because of the greediness of the match, $subpattern will contain "fox ate my box" rather than just "fox".
To match the minimum number of times, put a ? after the qualifier, like this:
$h = 'The fox ate my box of doughnuts'; $h =~ /(f.+?x)/; $subpattern = $1; |
Lazy matching works with any quantifier, such as +?, *? and {2,50}?.
|
| Contents |
Next |