Christiansen and Torkington, Perl Cookbook, Chapter 11, "References and Records" (and, for the self destructive and foolhardy, Section 13.13, "Coping with Circular Data Structures").
The perlref man page.
The perlobj man page.
ref: What Kind Of Value Does It Point To?
Sometimes you need a more complex data structure!
Examples:
| Spot_num | Ch1-BKGD | CH1 | Ch2-BKGD | Ch2 |
|---|---|---|---|---|
| 000 | 0.124 | 43.2 | 0.102 | 80.4 | 001 | 0.113 | 60.7 | 0.091 | 22.6 | 002 | 0.084 | 112.2 | 0.144 | 35.3 |
| Accession | Ch1-BKGD | CH1 | Ch2-BKGD | Ch2 |
|---|---|---|---|---|
| AW10021 | 0.124 | 43.2 | 0.102 | 80.4 | BE52002 | 0.113 | 60.7 | 0.091 | 22.6 | W20209 | 0.084 | 112.2 | 0.144 | 35.3 |
Well, first, what is a variable?
Think of a variable as a (named) box that holds a value. The name of the box is the name of the variable. After
$x = 1;
we have
+---+
$x: | 1 |
+---+
After
we have@y = (1, 'a', 23);
+---------------+
@y: | (1, 'a', 23) |
+---------------+
$list_ref = \@array; $map_ref = \%hash $c_ref = \$count;
Refs to subroutines
$sub_ref = \&subroutine;
A reference is an additional, rather different way, to name the variable. After
we have$ref_to_y = \@y;
+---------------+
+-> @y: | (1, 'a', 23) |
| +---------------+
|
|
+-|-+
$ref_to_y: | * |
+---+
$ref_to_y contains a reference (pointer) to @y.
print @y yields 1a23
and print $ref_to_y yields ARRAY(0x80cd6ac).
@{array_reference}
%{hash_reference}
${scalar_reference}
print @{$ref_to_y} yields 1a23.
After
$y[3] = 'z';
print @{$ref_to_y}
yields 1a23z.
Why?
+--------------------+
+-> @y: | (1, 'a', 23, 'z') |
| +--------------------+
|
|
+-|-+
$ref_to_y: | * |
+---+
After
we have@y = (5, 6, 7);
+----------+
+-> @y: | (5, 6, 7)|
| +----------+
|
|
+-|-+
$ref_to_y: | * |
+---+
print @{$ref_to_y}
yields 567.
After
we have$ref_to_y2 = $ref_to_y
+---+
$ref_to_y2: | * |
+-|-+
|
| +-----------+
+-> @y: | (5, 6, 7) |
+-> +-----------+
|
|
+-|-+
$ref_to_y: | * |
+---+
print @{$ref_to_y}
and
print @{$ref_to_y2}
both yield
567.
After
@z = @{$ref_to_y}
$ref_to_y->[0] = '2';
$ref_to_y2->[2] = '24';
we have
+---+
$ref_to_y2: | * |
+-|-+
|
| +------------+
+-> @y: | (2, 6, 24) |
+-> +------------+
|
|
+-|-+
$ref_to_y: | * |
+---+
+-----------+
@z: | (5, 6, 7) |
+-----------+
Given a reference called $hash_ref (to our favorite hash);
Given a reference called $my_cool_subroutine (to our favorite subroutine)
$y_gene_families = ['DAZ', 'TSPY', 'RBMY', 'CDY1', 'CDY2' ];
$y_gene_family_counts = { 'DAZ' => 4,
'TSPY' => 20,
'RBMY' => 10,
'CDY2' => 2 };
$y_gene_families gets a reference
to an array, and $y_gene_family_counts
gets a reference to a hash. (See the book for subroutines and scalars.)
For example
for (keys %{$y_gene_family_counts}) { print "$_\n" }
my @a = @{$y_gene_families};
${$y_gene_families}[0];
${$y_gene_family_counts}{'DAZ'}
Arrow shorthand:
$y_gene_families->[0]; # yields 'DAZ'
$y_gene_family_counts->{'DAZ'} # yields '4'
ref - What Kind Of Value Does This Reference Point To?
print ref($y_gene_families), "\n"; ARRAY print ref($y_gene_family_counts), "\n"; HASH $x = 1; print ref($x), "\n";
(empty string)
#!/usr/bin/perl
use strict;
@ARGV = '/net/share/perl_refs/cosmids1.txt' unless @ARGV;
$/ = ">";
my %DATA;
while (<>) {
chomp;
my ($id_line,@rest) = split "\n";
$id_line =~ /^(\S+)/ or next;
my $id = $1;
my $sequence = join '',@rest;
my $length = length $sequence;
my $gc_count = $sequence =~ tr/gcGC/gcGC/;
my $gc_content = $gc_count/$length;
$DATA{$id} = { sequence => $sequence,
length => $length,
gc_content => sprintf("%3.2f",$gc_content)
};
}
my @ids = sort { $DATA{$a}->{gc_content} <=> $DATA{$b}->{gc_content}
} keys %DATA;
foreach my $id (@ids) {
print "$id\n";
print "\tgc content = $DATA{$id}->{gc_content}\n";
print "\tlength = $DATA{$id}->{length}\n";
print "\n";
}
Use the x command.
Perl objects are special references that come bundled with a set of functions that know how to act on the contents of the reference. For example, in BioPerl, there is a Sequence object. Internally, the Sequence object is a hash reference that has keys that point to the DNA string, the name and source of the sequence, and other attributes. The object is bundled with functions that know how to manipulate the sequence, such as revcom(), translate(), subseq(), etc.
When talking about objects, the bundled functions are known as methods. This terminology derives from the grandaddy of all object-oriented languages, Smalltalk. You invoke a method using the -> operator, a syntax that looks a lot like getting at the value that a reference points to.
For example, if we have a Sequence object stored in the scalar variable $sequence, we can call its methods like this:
$reverse_complement = $sequence->revcom(); $first_10_bases = $sequence->subseq(1,10); $protein = $sequence->translate; |
We assume that you've created a Sequence object and stored it into $sequence at some previous point. We'll see how to do this later.
First we call the Sequence object's revcom() method, which creates the reverse complement of the sequence and stores it into the scalar variable $reverse_complement. Then we call subseq(1,10) to return the subsequence spanning bases one through ten. Finally we call the object's translate() method to turn the DNA into a protein. You will learn from the BioPerl lecture that revcom(), subseq() and translate() are all returning new Sequence objects that themselves know how to revcom(), translate() and so forth. So if you wanted to get the protein translation from the reverse complement, you could do this:
$reverse_complement = $sequence->revcom(); $protein = $reverse_complement->translate; |
Don't be put off by this syntax! $sequence is really just a hash reference, and you can get its keys using keys %$sequence, peek at the contents of the "_seq_length" key using $sequence->{_seq_length}, and so forth. Indeed, the syntax $sequence->translate is just a fancy way of writing translate($sequence), except that the object knows what module the translate() function is defined in.
Before you can start using objects, you must load their definitions from the appropriate module(s). This is just like loading subroutines from modules, and you use the use statement in both cases. For example, if we want to load the BioPerl Sequence definitions, we load the appropriate module, which in this case is called Bio::PrimarySeq (you learn this from reading the BioPerl documentation):
#!/usr/bin/perl -w use strict; use Bio::PrimarySeq; |
Now you'll probably want to create a new object. There are a variety of ways to do this, and details vary from module to module, but most modules, including Bio::PrimarySeq, do it using the new() method:
#!/usr/bin/perl -w
use strict;
use Bio::PrimarySeq;
my $sequence = Bio::PrimarySeq->new('gattcgattccaaggttccaaa'); |
The syntax here is ModuleName->new(@args), where ModuleName is the name of the module that contains the object definitions. The new() method will return an object that belongs to the ModuleName class. So in the example above, we get a Bio::PrimarySeq object, which is the simplest of BioPerl's various Sequence object types.
An alternative way to call new() puts the method name in front of the module name:
#!/usr/bin/perl -w
use strict;
use Bio::PrimarySeq;
my $sequence = new Bio::PrimarySeq('gattcgattccaaggttccaaa'); |
This is exactly equivalent to Bio::PrimarySeq->new(), but looks more natural to Java programmers.
When you call object methods, you can pass a list of arguments, just as you would to a regular function. We saw an example of this earlier when we called $sequence->subseq(1,10). As methods get more complex, argument lists can get quite long and have possibly dozens of optional arguments. To make this manageable, many object-oriented modules use a named parameter style of argument passing, that looks like this:
my $result = $object->method(-arg1=>$value1,-arg2=>$value2,-arg3=>$value3) |
In this case "-arg1", "-arg2", and so on are the names of arguments, and $value1, $value2 are the values of those named arguments. The name/value pairs can occur in any order.
As a practical example, Bio::PrimarySeq->new() actually takes multiple optional arguments that allow you to specify the alphabet, the source of the sequence, and so forth. Rather than create a humungous argument list which forces you to remember the correct position of each argument, Bio::PrimarySeq lets you create a new Sequence this way:
#!/usr/bin/perl -w
use strict;
use Bio::PrimarySeq;
my $sequence = Bio::PrimarySeq->new(-seq => 'gattcgattccaaggttccaaa',
-id => 'oligo23',
-alphabet => 'dna',
-is_circular => 0,
-accession_number => 'X123'
); |
Notice that we've broken the argument list across multiple lines. This makes it easier to read, but means nothing special to Perl.
@y,
$ref_to_y, etc.)
$e. What is the value of ref($DATA{$id})?
unwrap that takes as its argument the
name of a file, and that returns a reference to a HASH that maps sequence identifiers
to sequences, e.g.
$x = unwrap('/net/share/perl_refs/cosmids1.txt');
print $x->{'ZK1307.9'}, "\n";
print $x->{'ZK1248.6'}, "\n";
print $x->{'ZK1236.5'}, "\n";
produces
atgggagagcgtaaaggacaa... atggcccaatccgtcccaccg... tcagtcccatcgttttcttgc...
Hint (sketch only, not working code):
sub unwrap {
# ... Get argument, $filepath
my $result = {};
# ...
while (...) {
# ... Get $sequence_id and $sequence for each entry;
$result->{$sequence_id} = $sequence;
}
$result;
}
codons_threeframe, that takes as input
the data structure returned by the subroutine unwrap, and returns
a HASH reference that maps sequence id's to 4 element HASH reference "records".
Each record contains the keys: sequence, frame1,
frame2, frame3. For example (from cosmids.fasta):
{ 'ZK1037.9' => { 'sequence' => 'atgggagagcgtaaaggacaa...',
'frame1' => 'atg gga gag cgt aaa gga...',
'frame2' => 'tgg gag agc gta aag gac...',
'frame3' => 'ggg aga gcg taa agg aca...' },
'ZK1248.6' => { 'sequence' => 'atggcccaatccgtcccaccg...',
...
},
...
}
Running
$x = threeframes(unwrap('/net/share/perl_refs/cosmids1.fasta'));
print $x->{'ZK1307.9'}->{'frame1'}, "\n";
print $x->{'ZK1248.6'}->{'frame1'}, "\n";
print $x->{'ZK1248.6'}->{'frame3'}, "\n";
should produce
atg gga gag cgt aaa gga... atg gcc caa tcc gtc cca... ggc cca atc cgt ccc acc...
Hint (sketch only, not working code):
sub threeframes {
# Get argument $inhash
my $result = {};
for (keys %{$inhash}) {
my $seq = $inhash->{$_};
$result->{$_}->{'sequence'} = $seq;
# Do frame 1
my @frames = # .. you know how to do this ..
$result->{$_}->{'frame1'} = join(' ', @frames);
# Do frame 2
$seq = substr($seq, 1);
@frames = # .. you know how to do this ..
$result->{$_}->{'frame2'} = join(' ', @frames);
# Do frame 3
...
}
$result;
}
threeframes so that 'frame1' points to an ARRAY ref, each
element of which is one codon.