Chapter 5 of Beginning Perl for Bioinformatics.
Finding a pattern in a text is called "string matching" in computer science. In biology, the problem is finding a motif in sequence. Although Perl's success in bioinformatics is partly due to its excellent facilities for performing such searches of patterns in text, it is interesting and informative to attempt to write a program that accomplishes this on your own.
The subject of string matching is a fairly extensive part of the field of algorithm design and analysis. Many interesting solutions have been invented for many variations of the string matching problem. This exercise will get you thinking about the problem.
The variation of the problem you are challenged with here is known as "approximate string matching". In this simplified version, you are supposed to find the pattern even if the letters of the pattern may be separated by other letters in the text, up to a given value $gapmaximum (you may want to allow up to two extraneous characters for starters.) For instance, the pattern 'CSHL' would be said to exist at position 3 in the text 'ABECSHXXL'. (Remember two extraneous characters are allowed; and that the first letter 'A' is said to be at position 0.)
This problem may be done over the course of several workshops, depending on time. The lecturer will tell you which problems to attempt during the workshop!
You may want to place the pattern and text into arrays and then iterate over the arrays (use for example @text = split('',$text)) or you may want to examine each position in the strings (use for example $letter = substr($text, $position, 1)). Any solution is a good solution.
You may want to first think about the overall structure of your solution by using "pseudocode", an informal sketch of a program, such as the following:
Put the pattern and the text into scalars
Foreach position in the text
{
See if the pattern is there, print the location if it is
}
You may want to expand this pseudocode to flesh out your approach before you start coding.
Here is an example of a script that does exact string matching. I'll post a solution that does approximate string matching in due time. ;>)
Your results will be substantially different from this. This is just an example.
################################################################################
# Find a pattern in a text
################################################################################
use warnings;
use strict;
my $text = 'CCSHLCXYSHLABCSDHLEFCGSHILMCSHLZZZ';
my $pattern = 'CSHL';
my $scale = '0123456789012345678901234567890123456789';
print "$text\n$scale\n";
# For each position in the text
for(my $t=0; $t <= length($text) ; ++$t) {
# Look at the pattern one character at a time
for(my $p=0; $p < length($pattern) ; $p++) {
# If the characters in the text and the pattern don't match,
# exit this loop, restart with the next position in the text
last if substr($text, $t+$p, 1) ne substr($pattern, $p, 1);
# Otherwise the characters in the text and the pattern match
# Eureka if this was the last position in the pattern
if($p == length($pattern) - 1) {
print "Found $pattern at position $t\n";
last;
}
}
}
Here is a solution for this variation of the approximate string matching problem. Remember, there are many ways to program a solution, so your approach may look substantially different from that shown here, yet still be correct.
################################################################################
# strmatch
#
# Approximate string matching with insertions permitted:
# Finds pattern in text with insertions up to $gapmaximum in length
################################################################################
use warnings;
use strict;
my $text = 'CCSHLCXYSHLABCSDHLEFCGSHILMCSHLZZZ';
my $scale = '0123456789012345678901234567890123456789';
my $pattern = 'CSHL';
my $gapmaximum = 1;
print "$text\n$scale\n";
# For each position in the text
for(my $t=0; $t <= length($text)-length($pattern) ; ++$t) {
# Look at the pattern one character at a time
for(my $p=0, my $gapcount=0 ; $p < length($pattern) ; ) {
# If the characters in the text and the pattern are the same
if(substr($text,$t+$p+$gapcount,1) eq substr($pattern,$p,1) ) {
# Eureka if this was the last position in the pattern
if($p == length($pattern) - 1) {
print "Found $pattern at position $t: ";
print substr($text,$t,$p+$gapcount+1),"\n";
last;
# If pattern not done, go on to next position in pattern
}else{
$p++;
$gapcount = 0;
next;
}
# If the characters in the text and the pattern don't match
}else{
# Restart at next position in text if pattern not begun
last if $p == 0;
# Restart at next position in text if gap now too big
last if ++$gapcount > $gapmaximum;
# Otherwise $gapcount has been increased
}
}
}
Genome Informatics
Lincoln D. Stein, lstein@cshl.org
Cold Spring Harbor Laboratory Last modified: Sun Oct 20 15:36:27 EDT 2002