Perl Basics 3

Subroutines, Loops, Input/Output, Regular Expressions

Lincoln Stein

Genome Informatics

Suggested Reading

Chapters 6, 7 & 8 of Learning Perl.

Lecture Notes

  1. Defining Your Own Functions
  2. Loops
  3. Input/Output

Problems

These problems are to be done over the course of several workshops, depending on time. The lecturer will tell you which problems to attempt during the workshop!

  1. Write a script to read a text file one line at a time. Determine the length of each line, not counting the newline. When the text file is completely written, print out the total number of lines and the total number of characters in each line.

    You'll find a test file named example1.fasta in /mnt/share/data/perl3

          % count_lines example1.fasta
          TOTAL_LINES = 1392
          TOTAL_CHARACTERS = 97441
          

    Note, your results will differ from this -- this is just an example for output format.

  2. Modify the previous script to compute the distribution of line lengths. At the end of the script, print out a sorted two-column list of line lengths and the number times each length was seen (hint: use a hash and the sort function).
          % line_distribution example1.fasta
          TOTAL_LINES = 1392
          Length      Count
          12	       1
          20	       2
          28	       1
          36	       1
          40	       3
          60	       89
          

    Note, your results will differ from this -- this is just an example for output format.

  3. Modify the previous script to present the distribution data sorted by frequency rather than length:
       % line_distribution2 example1.fasta
       TOTAL_LINES = 1392
       Length      Count
       60	       89
       40	       3
       20	       2
       12	       1
       28	       1
       36	       1
    
  4. "Unwrap" the contents of a FASTA file, so that each sequence is printed out as one long line. Print the identifier, followed by a tab, followed by the sequence (hint: read up on the input record separator, $/, and on the pattern matching operator):
       % unwrap example1.fasta
       M43911 GATTCCGATCCCCCCCCCAGTTTGACCAAAGTTCAGAGGAAATCCCAGACCAAC....
       L54931 GGGTGGTGGTGAGAGAGAGCGATTGAAAGCTATATATATGACCGATTCACAGGT....
       L54932 TAGTTGATTCAGTCCGATTTCAATTGATTTCCCGTATATCCTTAAGGGTTTAAA....
    

  5. In Zea maize, the oligonucleotides Pu-C-G and Pu-C-X-G are two potentially methylated mcrBC recogition sites. Write a program to search for such sites in a FASTA file. The output should be the name of the sequence and the count of such sites found. (Hint: pipe the output of the unwrap program to this script in order to avoid reparsing the FASTA file.)

  6. From the same sequence file, search for sequences of (TA)n repeats, where n >= 5 and "mask" them by replacing them with N's.

  7. From the same sequence file, perform a reverse complimentation on each sequence and print them out in tab-delimited format:
          % unwrap example1.fasta | reverse_complement
          M43911 GTTGGTCTGGGATTTCCTCTGAACTTTGGTCAAACTGGGGGGGGGATCGGAATC...
          L54931 ACCTGTGAATCGGTCATATATATAGCTTTCAATCGCTCTCTCTCACCACCACCC...
          L54932 TTTAAACCCTTAAGGATATACGGGAAATCAATTGAAATCGGACTGAATCAACTA...
          

  8. From the same sequence file, split each sequence into three-letter "codons" separated by spaces:
       % unwrap example1.fasta | codons
       M43911 GAT TCC GAT CCC CCC CCC AGT TTG ACC AAA GTT CAG AGG AAA...
       L54931 GGG TGG TGG TGA GAG AGA GCG ATT GAA AGC TAT ATA TAT GAC...
       L54932 TAG TTG ATT CAG TCC GAT TTC AAT TGA TTT CCC GTA TAT CCT...
          

  9. Modify the program to print out codons in three different reading frames. Call it "codons_threeframe":
       % unwrap example1.fasta | codons_threeframe
       M43911.1 GAT TCC GAT CCC CCC CCC AGT TTG ACC AAA GTT CAG AGG AAA...
       M43911.2 ATT CCG ATC CCC CCC CCA GTT TGA CCA AAG TTC AGA GGA AAC...
       M43911.3 TTC CGA TCC CCC CCC CAG TTT GAC CAA AGT TCA GAG GAA ACC...
       L54931.1 GGG TGG TGG TGA GAG AGA GCG ATT GAA AGC TAT ATA TAT GAC...
       L54931.2 GGT GGT GGT GAG AGA GAG CGA TTG AAA GCT ATA TAT ATG ACT...
       ...
    
  10. Pipe the output of this program to a codon translation program named "ribosome":
         % unwrap example1.fasta | codons_threeframe | ribosome
         M43911.1 P   G   G   *   U   L   L   M   X   X   X   X   X   X
         M43911.2 ....
         M43911.3 
         L54931.1 
         L54931.2 
    
    To help you, cut and paste this translation table:
    %CODON_TABLE = (
       TCA => 'S',TCG => 'S',TCC => 'S',TCT => 'S',
       TTT => 'F',TTC => 'F',TTA => 'L',TTG => 'L',
       TAT => 'Y',TAC => 'Y',TAA => '*',TAG => '*',
       TGT => 'C',TGC => 'C',TGA => '*',TGG => 'W',
       CTA => 'L',CTG => 'L',CTC => 'L',CTT => 'L',
       CCA => 'P',CCG => 'P',CCC => 'P',CCT => 'P',
       CAT => 'H',CAC => 'H',CAA => 'Q',CAG => 'Q',
       CGA => 'R',CGG => 'R',CGC => 'R',CGT => 'R',
       ATT => 'I',ATC => 'I',ATA => 'I',ATG => 'M',
       ACA => 'T',ACG => 'T',ACC => 'T',ACT => 'T',
       AAT => 'N',AAC => 'N',AAA => 'K',AAG => 'K',
       AGT => 'S',AGC => 'S',AGA => 'R',AGG => 'R',
       GTA => 'V',GTG => 'V',GTC => 'V',GTT => 'V',
       GCA => 'A',GCG => 'A',GCC => 'A',GCT => 'A',
       GAT => 'D',GAC => 'D',GAA => 'E',GAG => 'E',
       GGA => 'G',GGG => 'G',GGC => 'G',GGT => 'G');
    	
  11. Pipe the output of these programs to a program named "longest_orf", to identify the longest ORF in the three possible reading frames:
        % unwrap example1.fasta | codons_threeframe | ribosome | longest_orf
        ID	     Frame	 Length (aa)
        M43911   2-480	 1650
        L54931   31-638	 202
        L54932   1-1032	 344
    
  12. Write a program called "gc_content" to determine the GC content of sequences in a FASTA file. The program should use a "sliding window" to compute the %GC, and allow the user to specify the size of the window on the command line (hint: use the standard Perl Getopt::Long module). The output should look like this:
        % unwrap example1.fasta | gc_content --window 50
        1     48.1
        2     48.2
        3	  48.1
        ...
        1000  55.8
    

Answers

Here are the answers to the problem set.


Genome Informatics


Lincoln D. Stein, lstein@cshl.org
Cold Spring Harbor Laboratory
Last modified: Sun Oct 20 15:35:14 EDT 2002