Then, I'll introduce the Bioperl project. We'll see some of the Perl modules it has to offer for bioinformatics data processing and how to use them.
Christiansen and Torkington, Perl Cookbook, Chapter 12, "Packages, Libraries, and Modules". Also, Chapter 13, "Classes, Objects, and Ties". [ http://www.oreilly.com/catalog/cookbook/ ]
| perlmod | Perl modules (packages and symbol tables) | perlref | Perl references and nested data structures | perlobj | Perl objects | perltoot | Tom's object-oriented tutorial for perl | perlbot | Bag'o Object Tricks (advanced stuff) |
Websites:
| genome-www.stanford.edu/perlOOP | A little collection I put together as I was learning Object-oriented Perl programming. See especially the examples page. |
| www.perl.com/CPAN/CPAN.html | CPAN homepage |
| search.cpan.org | Search CPAN for a module by author, category, or module name |
| bioperl.org | The Bioperl project - modules for bioinformatics |
Using Perl Modules:
Introduction to Bioperl:
Perl didn't always have modules and objects (they were introduced with Perl5). There are many "old-timers" that can get by without using modules and objects, designing scripts in so-called "procedural" style. So why should you consider using modules and objects?
For example, here probably the world's shortest Fasta-to-EMBL sequence format converter:
use Bio::SeqIO;
$in = Bio::SeqIO->newFh(-file => shift, '-format' => 'Fasta');
$out = Bio::SeqIO->newFh(-fh => \*STDOUT, '-format' => 'EMBL');
print $out $_ while <$in>;
A Module can be a simple collection of subroutines, not defining a class. Such modules are referred to as libraries.
A Class is a user-defined data type. Classes are defined in modules using special constructs that allow you to create variables of that type. The subroutines in these modules are referred to as methods.
An Object is a variable that belongs to a particular class. It contains its own, private data that can be operated on by the methods defined in the class. The object is called an instance of its class.
Example:
What's up with all the double colons (::)? The :: corresponds to a filesystem path separator. A Module called Shape::Rect corresponds to a file named Shape/Rect.pm relative to a module library directory.
Here's the familiar hash data structure:
%hash = ( color => 'green',
size => '17' );
print "Color is $hash{color}\n";
Here's a hash reference data structure:
$hash_ref = { color => 'green',
size => '17' };
# This is essentially equivalent to $hash = \%hash_ref;
print "Color is $hash_ref->{color}\n";
Here's a object data structure:
$object = Foo->new( -color=> 'green',
-size => '17' );
print "Color is ", $object->get_color(), "\n";
So an object is nothing but a hash reference that knows how to call methods (new()) defined in a particular module (Foo.pm).
An object (or module) has a public interface that allows users to call its subroutines without worrying about exactly how the operation is carried out. The interface consists of well-documented methods and is often referred to as a contract between the developer and the user.
Methods beginning with an underscore (e.g., _initialize())are considered private and should not be called by external users, although Perl cannot prevent this from happening.
An object (or module) contains all the subroutines and data necessary to perform as a well-defined unit.
use Getopt::Long;
die "Usage: $0 [-b] [-user name]\n" unless @ARGV;
my( $binary, $username );
GetOptions( "b" => \$binary,
"user=s" => \$username );
#!/usr/bin/perl -w
# Tell Perl we want to use the following modules:
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use strict; # always a good idea
# Get data from @ARGV or die with a usage string.
my $url = shift or die "Usage: $0 URL\n";
# Create an instance of a LWP::UserAgent object
my $ua = LWP::UserAgent->new();
# Set some data on the UserAgent (the agent() method is a "setter")
$ua->agent("Schmozilla/v9.14 Platinum");
# Create an instance of a HTTP::Request object
my $req = HTTP::Request->new( GET => $url );
# Submit the HTTP::Request object to the UserAgent object
# and get the response as another object (HTTP::Response)
my $response = $ua->request( $req );
# Then the $response object can be interrogated for its contents.
# (See recipe 20.1 in the Perl Cookbook for how)
Always check CPAN before embarking on a major Perl software development effort to see if there isn't already a module that does what you want to do. (Cross-fertilization between fields).
Recipes 12.7 and 12.17 in the Perl Cookbook have some useful tips for installing and using CPAN modules.
There are several sources of sample bioperl code.
perldoc Bio::Seq
These are also available via the web at http://bio.perl.org/Core/Latest/modules.html. Perldoc can also be run on a module file directly (e.g., perldoc ~/perl/lib/Bio/Seq.pm), useful when the module isn't installed on your system.
For your convenience, a copy of the distribution directory is available locally: bioperl-0.6.2
In this case, Child is a class that inherits functionality from Parent1, and Parent1 is a class that inherits functionality from GrandParent. Child is often called the subclass of Parent1, which is the superclass of Child. GrandParent is a superclass of both Parent1 and Child.
Child also inherits functionality from Parent2, so it has two superclasses (this is a case of multiple inheritance) which is permitted in Perl.
The diagram also indicates that Child can contain a reference to an object of type ContainedByChild.
For more information about UML, an established standard, see http://directory.google.com/Top/Computers/Software/Object_Oriented/Methodologies/UML/.
use Bio::PrimarySeq;
# Create a new PrimarySeq object
$seqobj = Bio::PrimarySeq->new(-seq => 'ACTGTGGCGTCAACTG',
-moltype => 'dna',
-id => 'Primer-22301');
# Get a substring from it
$substring = $seqobj->subseq( 2, 4 );
Bio::Seq
Bio::SeqIO
use Bio::SeqIO;
# Note that we don't have to use Bio::Seq in order to work with
# Bio::Seq objects.
# To create a Bio::Seq object you must first create a SeqIO object
$seqio = Bio::SeqIO->new ( '-format' => 'Fasta' ,
'-file' => 'myfile.fasta');
# Get the first sequence in the file.
$seqobj = $seqio->next_seq();
# Get the actual sequence as a string
$seq = $seqobj->seq();
# Bio::Seq also inherits the same subseq() method from PrimarySeq
$seqstr = $seqobj->subseq(10,50);
# Iterate through the remaining sequences in the file
while ( my $seq = $seqio->next_seq() ) {
# Get the sequence as a string for some analysis
my ($id, $str) = $seq->display_id, $seq->seq;
}
Bio::SeqFeatureI
Bio::Seq objects also may have features attached to them. Bioperl features implement the methods defined by Bio::SeqFeatureI.
# Get top level features
@features = $seqobj->top_SeqFeatures();
# Descend into sub features (features may themselves have features)
@features = $seqobj->all_SeqFeatures();
# Get an Annotation object
$ann = $seqobj->annotation();
Annotation objects are a recent addition to Bioperl and provide:
The Bio::Tools::RestrictionEnzyme module represents a restriction endonuclease. It can conceptually "cut" a Bio::Seq into fragments based on specificity of an actual enzyme. It allows for the construction of objects representing about 150 different REs. There's also a mechanism for defining the recognition sequence when creating the object, to cover new enzymes.
use Bio::Seq;
use Bio::Tools::RestrictionEnzyme;
# Get a sequence string from somewhere.
$sequence = get_sequence();
# Create a Bio::Seq object using the string
$seq = new Bio::Seq( -ID => 'test_seq',
-SEQ => $sequence);
$re = new Bio::Tools::RestrictionEnzyme( -name => 'EcoRI' );
## Cut the sequence with the restriction enzyme object.
@fragments = $re->cut_seq( $seq );
Example usage:
sortnums.pl 32 8 5 12 302 - output: 5 8 12 32 302 sortnums.pl -dec 32 8 5 12 302 - output: 302 32 12 8 5 sortnums.pl -fact 10 2 4 6 - output: 20 40 60 sortnums.pl -h - output: sortnums.pl [-rev] [-fact N] [-h] list-of-numbers
use Tools::Foo;
my $f = Tools::Foo->new( -bar => 234 );
my $b = $f->get_bar();
| 1 - method call | A - $b |
| 2 - object instance | B - Tools::Foo |
| 3 - method argument | C - $f |
| 4 - name of directory containing module | D - Tools |
| 5 - name of class | E - Tools::Foo->new() |
| 6 - object creation | F - get_bar() |
| 7 - method inside of module | G - -bar => 234 |
| 8 - holds object data | H - $f->get_bar() |
data/worm/proteins.fasta data/yeast/orf_trans.fasta (protein) data/yeast/orf_coding.fasta (DNA)
data/hd/dna/HD_PATIENT1.fasta
data/hd/dna/HD_PATIENT2.fasta
data/hd/dna/HD_HUMAN.fasta (normal)
(For this problem, feel free to examine the examples/restriction.pl script in the Bioperl distribution directory).
Extra credit: