Workshop

Sequence Similarity Analysis

James Tisdall

Genome Informatics

Suggested Reading

Chapter 5 of Beginning Perl for Bioinformatics.

Lecture Notes

Finding a pattern in a text is called "string matching" in computer science. In biology, the problem is finding a motif in sequence. Although Perl's success in bioinformatics is partly due to its excellent facilities for performing such searches of patterns in text, it is interesting and informative to attempt to write a program that accomplishes this on your own.

The subject of string matching is a fairly extensive part of the field of algorithm design and analysis. Many interesting solutions have been invented for many variations of the string matching problem. This exercise will get you thinking about the problem.

The variation of the problem you are challenged with here is known as "approximate string matching". In this simplified version, you are supposed to find the pattern even if the letters of the pattern may be separated by other letters in the text, up to a given value $gapmaximum (you may want to allow up to two extraneous characters for starters.) For instance, the pattern 'CSHL' would be said to exist at position 3 in the text 'ABECSHXXL'. (Remember two extraneous characters are allowed; and that the first letter 'A' is said to be at position 0.)

Problems

This problem may be done over the course of several workshops, depending on time. The lecturer will tell you which problems to attempt during the workshop!

  1. Write a script that finds the pattern 'CSHL' in the text 'CXYSHLABCSDHLEFCGSHILW'. There may be up to 2 (or $gapnumber) gaps between any two letters. Don't use builtin Perl features like pattern matching (e.g. $text =~ /CSHL/) or grep. Instead, write a program that examines each letter in the pattern and text. The program should print out the locations in the text where it finds the pattern.

    You may want to place the pattern and text into arrays and then iterate over the arrays (use for example @text = split('',$text)) or you may want to examine each position in the strings (use for example $letter = substr($text, $position, 1)). Any solution is a good solution.

    You may want to first think about the overall structure of your solution by using "pseudocode", an informal sketch of a program, such as the following:

    	Put the pattern and the text into scalars
    	Foreach position in the text
    	{
    		See if the pattern is there, print the location if it is
    	}
          

    You may want to expand this pseudocode to flesh out your approach before you start coding.

Answers

Here is an example of a script that does exact string matching. I'll post a solution that does approximate string matching in due time. ;>)

Your results will be substantially different from this. This is just an example.

################################################################################
# Find a pattern in a text
################################################################################
use warnings;
use strict;

my $text = 'CCSHLCXYSHLABCSDHLEFCGSHILMCSHLZZZ';
my $pattern = 'CSHL';
my $scale = '0123456789012345678901234567890123456789';

print "$text\n$scale\n";

# For each position in the text
for(my $t=0; $t <= length($text) ; ++$t) {
        # Look at the pattern one character at a time
	for(my $p=0; $p < length($pattern) ; $p++) {
                # If the characters in the text and the pattern don't match,
		#  exit this loop, restart with the next position in the text
		last if substr($text, $t+$p, 1) ne substr($pattern, $p, 1);
                # Otherwise the characters in the text and the pattern match
                # Eureka if this was the last position in the pattern
		if($p == length($pattern) - 1) {
			print "Found $pattern at position $t\n";
			last;
		}
	}
}

Here is a solution for this variation of the approximate string matching problem. Remember, there are many ways to program a solution, so your approach may look substantially different from that shown here, yet still be correct.

################################################################################
# strmatch
#
# Approximate string matching with insertions permitted:
# Finds pattern in text with insertions up to $gapmaximum in length
################################################################################
use warnings;
use strict;

my $text = 'CCSHLCXYSHLABCSDHLEFCGSHILMCSHLZZZ';
my $scale = '0123456789012345678901234567890123456789';
my $pattern = 'CSHL';
my $gapmaximum = 1;
print "$text\n$scale\n";

# For each position in the text
for(my $t=0; $t <= length($text)-length($pattern) ; ++$t) {
	# Look at the pattern one character at a time
	for(my $p=0, my $gapcount=0 ; $p < length($pattern) ; ) {
		# If the characters in the text and the pattern are the same
		if(substr($text,$t+$p+$gapcount,1) eq substr($pattern,$p,1) ) {
			# Eureka if this was the last position in the pattern
			if($p == length($pattern) - 1) {
				print "Found $pattern at position $t: ";
				print substr($text,$t,$p+$gapcount+1),"\n";
				last;
			# If pattern not done, go on to next position in pattern
			}else{
				$p++;
				$gapcount = 0;
				next;
			}
		# If the characters in the text and the pattern don't match
		}else{
			# Restart at next position in text if pattern not begun
			last if $p == 0;
			# Restart at next position in text if gap now too big 
			last if ++$gapcount > $gapmaximum;
			# Otherwise $gapcount has been increased
		}
	}
}

Genome Informatics


Lincoln D. Stein, lstein@cshl.org
Cold Spring Harbor Laboratory
Last modified: Sun Oct 20 15:36:27 EDT 2002