Several built-in globals affect input and output:
Say you have a text file containing records in the following interesting format:
>gi|5340860|gb|AI793144.1|AI793144 on36f02.y5 NCI_CGAP_Lu5 Homo sapiens cDNA clone CAAACAGCCCCCGATAACGCTACGTGAGCTGGGCCCTGGGCCTGAGGCAGAAAACGGACGGAAGAAAAGG TCTGGCCGGAGATGGGTCTCACTCTGTCACCCAGACTGGAGTGCAGTGAGTGGTGCGATCATAGCTTACT GCAGCCTGAAACTCCTGGGCTCAAGTGATCTTCTCGCCTCAGCCTCCTGAGTAGCTGGAGCTACAGGAAT GAGCATAGATGAACAATGTTGCATCACGCTTGACATCACCGGNGCTTCTTTCCAGTGTGGATTTGCTCAT GTAAAATGAGGTGTGAGCTCTGCCTGAAAGCTTTTCCATATGCATCACATTTGCAGGGCTTTTCTCCAGT GTGGGTTCTTTGGTGTCTCAAAAGATGTGAGCTGTTACTGAAAGCTTTCCCACACACATCACACTCATAG GGCTTCTCTCTACCGTGGATTCGCTGGTGTCCAACAAGAGCTGAACTGTATCTGAAGGCCTTTCCACGCT TGTCACATTCATATAGTTTCTTTCCACTGTGGATTNTCTGGTGACAGAAGAGGCCCAAGCACTAGCTAAA GCTNTTCCCTCACTCACTACACTGCTATGGCTTCTCTTCAGTATGAACTCTGATGTTGTCTCAGATATGA ACTCAGAGAGGATNTCCCACAATCATTACACTGGTATGGTTCCTTTTCGTGTGAGTTCTCTGGTGTCNAA ATACATCTGAGCTGTGATGAAAGAACTTNCCACACTCACTACATTGGGAAGG >gi|4306680|gb|AI451833.1|AI451833 mx13e08.y1 Soares mouse NML Mus musculus cDNA clone TGAATGTATGCAGTGCGGAAAGACATTCACTTCTGGCCACTGTGCCAGAAGACATTTAGGGACTCACAGT GGAGCCTGGCCTTACAAATGTGAAGTGTGTGGGAAAGCTTATCCCTACGTCTATTCCCTTCGAAACCACA AAAAAAGTCACAACGAAGAAAAACTTTATGAATGTAAACAATGTGGGAAAGCCTTTAAATACATTTCTTC CTTACGCAACCACGAGACTACTCACACTGGAGAGAAGCCCTATGAATGTAAGGAATGTGGGAAAGCCTTT AGTTGTTCCAGTTACATTCAAAATCACATGAGAACACACAAAAGGCAGTCCTATGAATGTAAGGAGTGTG GTAAGGTGTTCTCATATTCCAAAAGTCTTCGGAGACACATGACTACACATAGTTAATTAGAGAGGGATAG TTNTAAGTATAATTTAAATATATAAAAGAGCTCTACACATTCTAGCTCCTCATTAAGAAACAAAAAATTT CACACTGGAAAACGAGCCTATGAATGCAGTATGTGTGCCAAAGTCTCAGTACATGCCACAGT >gi|3400733|gb|AI074089.1|AI074089 oq97c08.x1 NCI_CGAP_Co12 Homo sapiens cDNA clone GAATCTTCTGGGTCCTCTTTATTAAGAGCCCTCTGCCTTCCCAGGGGAGGGAAGCAAATCCTTCAGGGCC CCCAGAGTTCCTGCACCCCATATCATGGGTGAGTCCTACCAGCCACAGAGCCACCCGTCACCGTGGAGAG GCTTAAGCTGCACTCAGAGCTCCCCCCGGGCATGCCGAATGTAGTGTTGATGCAGCCCTGCTTCCTGAGC AAAGTCCTGACCGCACTCTGTGCAGGCGAAGGTGCCAGGAGGGGCACGGACCTCATGCATCTGGCGGTGC CGCCTCAGAGAAACAGCCTGCCCAAAGGTCTTGCCACAGTCAGGACAAGGGAAGGTGGGCTGGGCAGTAG TGGTTGCAACCGGCAGGGTGGGCTTGGCGGCTGGACCGTGGCTGCGCTGGTGGGTGATTAGGGCTTTGGA ...If you use standard <>, you will get a line at a time, and have to figure out where one record ends and a new one starts. However, if you set the input record separator to ">", then each time you read a "line", you will read all the way to the next ">" symbol. Throw away the first record (which is empty), keep the others.
#!/usr/local/bin/perl
# file: get_fasta_records.pl
$/ = '>';
<> # throw away the first record (will be empty)
while (<>) {
chomp;
# split up lines of the record. The first line
# is the sequence ID. The second and subsequent lines
# are the sequence
my ($id,@sequence) = split "\n";
my $sequence = join '',@sequence; # reassemble the sequence
} |
The input record separator has two special cases.
If the input record separator ($/) is set to the empty string ("") it goes into paragraph mode. Each <> will read up to the next blank line. Multiple blank lines will be skipped over. This is good for reading text separated into paragraphs.
If the input record separator is set to the undefined value (undef) then it goes into slurp mode. The <> operator will read its entire input into a single scalar.
Here's how to read the entire file cosmids.fasta into a scalar variable:
open IN,"cosmids.fasta" or die "Can't open cosmids.fasta: $!\n"; $/ = undef; $data = <IN>; # data slurp |
|
| Contents |
Next |