Steve Chervitz - CSHL Genome Informatics

Using Perl Modules

Steve A. Chervitz
Neomorphic, Inc. / The Bioperl Project
sac@neomorphic.com

Introduction

Perl modules provide a powerful way to extend the core features of the Perl language. They are easy to use and are not that hard to create. In this lecture, I'll first cover some basic information about modules and objects that will empower you to become fearless modules users.

Then, I'll introduce the Bioperl project. We'll see some of the Perl modules it has to offer for bioinformatics data processing and how to use them.

Suggested Reading

Christiansen and Torkington, Perl Cookbook, Chapter 12, "Packages, Libraries, and Modules". Also, Chapter 13, "Classes, Objects, and Ties". [ http://www.oreilly.com/catalog/cookbook/ ]

Man Pages:
perlmod Perl modules (packages and symbol tables)
perlref Perl references and nested data structures
perlobj Perl objects
perltoot Tom's object-oriented tutorial for perl
perlbot Bag'o Object Tricks (advanced stuff)

Websites:
genome-www.stanford.edu/perlOOP A little collection I put together as I was learning Object-oriented Perl programming. See especially the examples page.
www.perl.com/CPAN/CPAN.html CPAN homepage
search.cpan.org Search CPAN for a module by author, category, or module name
bioperl.org The Bioperl project - modules for bioinformatics

Lecture Notes

Using Perl Modules:

Introduction to Bioperl:

Problem Sets

Why use Perl modules and objects?

Perl didn't always have modules and objects (they were introduced with Perl5). There are many "old-timers" that can get by without using modules and objects, designing scripts in so-called "procedural" style. So why should you consider using modules and objects?

What is a module? What is a class? Object?

A Module is a collection of related subroutines and data designed to be used by other programs or modules. Modules are defined in separate files and use the package keyword to define a namespace for the data and subroutines in that module file.

A Module can be a simple collection of subroutines, not defining a class. Such modules are referred to as libraries.

A Class is a user-defined data type. Classes are defined in modules using special constructs that allow you to create variables of that type. The subroutines in these modules are referred to as methods.

An Object is a variable that belongs to a particular class. It contains its own, private data that can be operated on by the methods defined in the class. The object is called an instance of its class.

Example:

What's up with all the double colons (::)? The :: corresponds to a filesystem path separator. A Module called Shape::Rect corresponds to a file named Shape/Rect.pm relative to a module library directory.

From references to objects

A Perl object is just another data structure, nearly identical to a hash reference. You can use an object as a convenient way to store data.

Here's the familiar hash data structure:

         %hash = ( color => 'green',
                   size  => '17' );

         print "Color is $hash{color}\n";
      

Here's a hash reference data structure:

         $hash_ref = { color => 'green',
                       size  => '17' };

         # This is essentially equivalent to $hash = \%hash_ref;

         print "Color is $hash_ref->{color}\n";
      

Here's a object data structure:

         $object = Foo->new( -color=> 'green',
                             -size => '17' );

         print "Color is ", $object->get_color(), "\n";
      

So an object is nothing but a hash reference that knows how to call methods (new()) defined in a particular module (Foo.pm).

Object-oriented concepts

Using library modules: Getopt

Parsing of command-line arguments is such a common task that Perl comes with some standard modules for doing this: Getopt::Std and Getopt::Long. Recipe 15.1 in the Perl Cookbook shows how to use these. The basic idea is as follows:
        use Getopt::Long;

        die "Usage: $0 [-b] [-user name]\n" unless @ARGV;

        my( $binary, $username );

        GetOptions( "b"       => \$binary,
                    "user=s"  => \$username );

      

Using object-oriented modules: LWP

The LWP suite is an indispensible collection of modules for working with data from the internet. The modules are of the object-oriented type, so you create objects calling the new() method and then call various methods on the object you get back.
        #!/usr/bin/perl -w

        # Tell Perl we want to use the following modules:
        use LWP::UserAgent;
        use HTTP::Request;
        use HTTP::Response;
        use strict;          # always a good idea

        # Get data from @ARGV or die with a usage string.

        my $url = shift or die "Usage: $0 URL\n";

        # Create an instance of a LWP::UserAgent object

        my $ua = LWP::UserAgent->new();

        # Set some data on the UserAgent (the agent() method is a "setter")

        $ua->agent("Schmozilla/v9.14 Platinum");

        # Create an instance of a HTTP::Request object

        my $req = HTTP::Request->new( GET => $url );

        # Submit the HTTP::Request object to the UserAgent object
        # and get the response as another object (HTTP::Response)

        my $response = $ua->request( $req );

        # Then the $response object can be interrogated for its contents.
        # (See recipe 20.1 in the Perl Cookbook for how)
      

Other useful modules

Where can I find useful modules?

Always check CPAN before embarking on a major Perl software development effort to see if there isn't already a module that does what you want to do. (Cross-fertilization between fields).

Recipes 12.7 and 12.17 in the Perl Cookbook have some useful tips for installing and using CPAN modules.

Introducing The Bioperl Project

Goals:

Documentation for Bioperl modules

There is a lot of code in the Bioperl arsenal. This can be intimidating at first. An excellent way to learn about a module is to see working examples.

There are several sources of sample bioperl code.

  1. perldoc, e.g., execute at a shell prompt:

    perldoc Bio::Seq

    These are also available via the web at http://bio.perl.org/Core/Latest/modules.html. Perldoc can also be run on a module file directly (e.g., perldoc ~/perl/lib/Bio/Seq.pm), useful when the module isn't installed on your system.

  2. The examples directory of the Bioperl distribution.

  3. The t (test) directory of the Bioperl distribution.

For your convenience, a copy of the distribution directory is available locally: bioperl-0.6.2

Describing Object Relationships

Inheritance vs. composition. A class can inherit functionality (i.e., methods) from other classes. It can also contain objects beloging to other classes. Object relationships are most conveniently described using UML diagrams as shown below.

In this case, Child is a class that inherits functionality from Parent1, and Parent1 is a class that inherits functionality from GrandParent. Child is often called the subclass of Parent1, which is the superclass of Child. GrandParent is a superclass of both Parent1 and Child.

Child also inherits functionality from Parent2, so it has two superclasses (this is a case of multiple inheritance) which is permitted in Perl.

The diagram also indicates that Child can contain a reference to an object of type ContainedByChild.

For more information about UML, an established standard, see http://directory.google.com/Top/Computers/Software/Object_Oriented/Methodologies/UML/.

Key Bioperl objects

Bio::PrimarySeq
     use Bio::PrimarySeq;

     # Create a new PrimarySeq object

     $seqobj = Bio::PrimarySeq->new(-seq     => 'ACTGTGGCGTCAACTG', 
                                    -moltype => 'dna', 
                                    -id      => 'Primer-22301');

     # Get a substring from it

     $substring = $seqobj->subseq( 2, 4 );
      

Bio::Seq

Bio::SeqIO

    use Bio::SeqIO;

    # Note that we don't have to use Bio::Seq in order to work with
    # Bio::Seq objects.

    # To create a Bio::Seq object you must first create a SeqIO object

    $seqio  = Bio::SeqIO->new ( '-format' => 'Fasta' , 
                                '-file'   => 'myfile.fasta');

    # Get the first sequence in the file.
    $seqobj = $seqio->next_seq();

    # Get the actual sequence as a string
    $seq      = $seqobj->seq(); 

    # Bio::Seq also inherits the same subseq() method from PrimarySeq
    $seqstr   = $seqobj->subseq(10,50);    

    # Iterate through the remaining sequences in the file

    while ( my $seq = $seqio->next_seq() ) {
        # Get the sequence as a string for some analysis
        my ($id, $str) = $seq->display_id, $seq->seq;
    }
      

Bio::SeqFeatureI

Bio::Seq objects also may have features attached to them. Bioperl features implement the methods defined by Bio::SeqFeatureI.

    # Get top level features

    @features = $seqobj->top_SeqFeatures(); 

    # Descend into sub features (features may themselves have features)

    @features = $seqobj->all_SeqFeatures(); 

    # Get an Annotation object
    $ann = $seqobj->annotation(); 
    

Annotation objects are a recent addition to Bioperl and provide:

Survey of Bioperl Tools

Bioperl Restriction Enzyme object

The Bio::Tools::RestrictionEnzyme module represents a restriction endonuclease. It can conceptually "cut" a Bio::Seq into fragments based on specificity of an actual enzyme. It allows for the construction of objects representing about 150 different REs. There's also a mechanism for defining the recognition sequence when creating the object, to cover new enzymes.

     use Bio::Seq;
     use Bio::Tools::RestrictionEnzyme;

     # Get a sequence string from somewhere.
     $sequence = get_sequence();  

     # Create a Bio::Seq object using the string
     $seq = new Bio::Seq( -ID  => 'test_seq', 
	  		  -SEQ => $sequence); 
    
     $re  = new Bio::Tools::RestrictionEnzyme( -name => 'EcoRI' );
  
     ## Cut the sequence with the restriction enzyme object.
     @fragments = $re->cut_seq( $seq );

    

How can I get involved in Bioperl?

Using Perl Modules - Problem Set

Perl Module Exercises

  1. Write a script called sortnums.pl that takes as input a list of unordered numbers on the command line and prints them out sorted in either ascending or descending order. The script should be able to accept two optional arguments which it obtains using the Getopt::Long module:

    Example usage:

      sortnums.pl 32 8 5 12 302        - output: 5 8 12 32 302
      sortnums.pl -dec 32 8 5 12 302   - output: 302 32 12 8 5
      sortnums.pl -fact 10 2 4 6       - output: 20 40 60
      sortnums.pl -h                   - output: sortnums.pl [-rev] [-fact N] [-h] list-of-numbers
    	

  2. Connect the columns (i.e, produce the correct pairing from the two columns in the table below such as 1-C, 2-G) based on this hypothetical code snipped):
             use Tools::Foo;
    
             my $f = Tools::Foo->new( -bar => 234 );
    
             my $b = $f->get_bar();
    	
    1 - method call A - $b
    2 - object instance B - Tools::Foo
    3 - method argument C - $f
    4 - name of directory containing module D - Tools
    5 - name of class E - Tools::Foo->new()
    6 - object creation F - get_bar()
    7 - method inside of module G - -bar => 234
    8 - holds object data H - $f->get_bar()

Bioperl Exercises

  1. Collect some statistics about the set of sequences in a genome. Pick one of the following multi-sequence fasta files and use the Bio::SeqIO modules to collect the data described below.
    	data/worm/proteins.fasta    
    	data/yeast/orf_trans.fasta   (protein)
    	data/yeast/orf_coding.fasta  (DNA) 
    1. How many sequences total per file?
    2. Determine the lengths of the longest/shortest sequences in the file.
    3. How many sequences over length 1000?

  2. Use the Bio::Tools::SeqStats module to collect molecular weight data for the longest and shortest protein sequences as determined in problem 1b.

  3. A local human genetics laboratory down the road heard that we had some cool software and wants to hire us to do a DNA diagnostic test for them. We say, "No problem! Send over the data." They have some patients presenting symptoms suggestive of Huntington's Disease but they way to know if their HD gene is characteristically altered (mutated), which would strengthen the diagnosis. They have given us give us HD gene sequences from two individuals located at:
          data/hd/dna/HD_PATIENT1.fasta
          data/hd/dna/HD_PATIENT2.fasta 
          data/hd/dna/HD_HUMAN.fasta (normal)
    	

    1. Determine lengths of each sequence.
    2. Conceptually digest each sequence with PstI using the Bio::Tools::RestrictionEnzyme.pm module. How many fragments are obtained?
    3. The HD mutation of interest is a repeat expansion within the 5' end of the gene, leading to a up-shift in the size of the first PstI fragment (which is normally around 1KB) of roughly 100-200 base pairs. Do the data support the possibility of either patient having HD?
    4. Determine the number of glutamine codon repeats in the first PstI fragment from both sequences. Approx 40-60 is normal, over 80 is highly diagnostic.

    (For this problem, feel free to examine the examples/restriction.pl script in the Bioperl distribution directory).

Extra credit:

  1. Use pmdesc to find out what modules are installed on shrub.
  2. Use the LWP and related modules to download the contents of a web page of interest and save it to a local .html file.
  3. The sequence data files used in problem 1 are from last year (I didn't get a chance to update them - give me a break!). Using your new-found skills with Web-based genome resources, download the updated versions of these files and compare your numbers. Are there any differences for these "completed" genome sequences?
  4. Extra credit: Create a histogram of sequence lengths at 100-residue intervals. Is the distribution normal?

  5. Using the Bio::DB::GenBank.pm module, write a script that retrieves the GenBank record for sequence with GenBank accession HUMBHSD from the NCBI.

Steve A. Chervitz, sac@neomorphic.com
Last modified: Mon Nov 6 13:13:33 PST 2000