Perl can act as a Web browser to retrieve URLs on your behalf. There is a simple way that works only for GET scripts, and a more complex way that works for both GET and POST scripts as well.
This program retrieves the GenBank flat file document given an
accession number. All the magic happens in get(), which
fetches the page pointed to by a URL and returns its contents as a
string.
Code:
#!/usr/bin/perl -w # file: fetch_gb.pl use LWP::Simple; my $NCBI_URL = 'http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=n&form=6&Dopt=g&html=no&uid='; my $accession = shift or die "Provide an accession number on the command line\n"; my $url = $NCBI_URL . $accession; my $record = get($url); print $record; |
Output:
(~) 69% fetch_gb.pl M12345
Entrez Reports
----------------
LOCUS MUSMYCN 1540 bp DNA ROD 27-APR-1993
DEFINITION Mouse (ST4) c-myc proto-oncogene, promoter region.
ACCESSION M12345
NID g199964
VERSION M12345.1 GI:199964
KEYWORDS myc proto-oncogene; proto-oncogene.
SOURCE Mouse (cell line ST4, from S-MuLV infected BALB/c mouse) DNA.
ORGANISM Mus musculus
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia;
Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.
REFERENCE 1 (bases 1 to 1502)
AUTHORS Corcoran,L.M., Cory,S. and Adams,J.M.
TITLE Transposition of the immunoglobulin heavy chain enhancer to the myc
oncogene in a murine plasmacytoma
JOURNAL Cell 40, 71-79 (1985)
MEDLINE 85099331
REFERENCE 2 (bases 1 to 1540)
AUTHORS Corcoran,L.M.
JOURNAL Unpublished (1986)
COMMENT A printed copy of the sequence in [1],[2] was kindly provided by
L.M.Corcoran 21-OCT-1985.
FEATURES Location/Qualifiers
source 1..1540
/organism="Mus musculus"
/db_xref="taxon:10090"
misc_feature 72..73
/note="ST4 proviral insertion site"
misc_feature 96..97
/note="Tikaut proviral insertion site"
misc_feature 876..877
/note="ST1 proviral insertion site"
allele 1124
/note="g in ST4; a in ABPC17"
misc_feature 1129..1130
/note="ABPC17 Ig H-chain enhancer insertion site"
mRNA 1491..>1540
/note="myc mRNA"
BASE COUNT 366 a 392 c 426 g 356 t
ORIGIN 1 bp upstream of XbaI site; chromosome 15.
1 tctagaacca atgcacagag caaaagactc atgtttctgg ttggttaata agctagatta
61 tcgtgtatat ataaagtgtg tatgtatacg tttggggatt gtacagaatg cacagcgtag
121 tattcaggaa aaaggaaact gggaaattaa tgtataaatt aaaatcagct tttaattagc
181 ttaacacaca catacgaagg caaaaatgta acgttacttt gatctgatca gggccgactt
241 ttttttttaa gtgcataatt acgattccag taataaaagg ggaaagcttg ggtttgtcct
301 gggaggaagg ggttaacggt tttctttatt ctagggtctc tgcaggctcc ccagatctgg
361 gttggcaatt cactcctccc cctttctggg aagtccgggt tttccccaac cccccaattc
421 atggcatatt ctcgcgtcta gccttgattt tccccacccc agctcctaaa ccagagtctg
481 ctgcaaactg gctccacagg ggcaaagagg atttgcctct tgtgaaaacc gactgtggcc
541 ctggaactgt gtggaggtgt atggggtgta gaccggcaga gactcctccc ggaggagccg
601 gtagagcgca cccgccgcca ctttactgga ctgcgcaggg agacctacag gggaaagagc
661 cgcctccaca ccacccgccg gtggaagtcc gaaccggagg tgctggagtg tgtgtgtggg
721 gggggggggg ggaatctgcc ttttggcagc aaattggggg gggggtcgtt ctggaaagaa
781 tgtgcccagt caacataact gtacgaccaa aggcaaaata cacaatgcct tccccgcgag
841 atggagtggc tgtttatccc taagtggctc tccaagtata cgtggcagtg agttgctgag
901 caattttaat aaaattccag acatcgtttt tcctgcatag acctcatctg cggttgatca
961 ccctctatca ctccacacac tgagcggggg ctcctagata actcattcgt tcgtccttcc
1021 ccctttctaa attctgtttt ccccagcctt agagagacgc ctggccgccc gggacgtgcg
1081 tgacgcggtc cagggtacat ggcgtattgt gtggagcgag gcagctgttc cacctgcggt
1141 gactgatata cgcagggcaa gaacacagtt cagccgagcg ctgcgcccga acaaccgtac
1201 agaaagggaa aggactagcg cgcgagaaga gaaaatggtc gggcgcgcag ttaattcatg
1261 ctgcgctatt actgtttaca ccccggagcc ggagtactgg actgcgggct gaggctcctc
1321 ctcctctttc cccggctccc cactagcccc ctcccgagtt cccaaagcag agggcgggga
1381 aacgagagga aggaaaaaaa tagagagagg tggggaaggg agaaagagag gttctctggc
1441 taatccccgc ccacccgccc tttatattcc gggggtctgc gcggccgagg acccctggct
1501 gcgctgctct cagctgccgg gtccgactcg cctcactcag
//
For GET scripts it's easy. Just navigate to the page you want to fetch on the remote site and then write down the URL you see in the browser. You may need to fetch a few different pages to figure out which URL arguments to change to get the results you want.
To figure out the NCBI parameters was even easier: they tell you how to link to their pages in the document at http://www.ncbi.nlm.nih.gov/Entrez/linking.html.
The CGI parameters use a horrible encoding system for spaces and other funny characters, known as application/x-www-url-encoded. If you have punctuation in your URLs, you have to escape them by replacing them with %XX, where XX is the hexadecimal code for the character.
Yuck.
Here is an easy way to construct the query string using CGI.pm:
#!/usr/bin/perl -w # file: fetch_gb.pl use CGI qw(:standard); use LWP::Simple param(-name=>'db', -value=>'n'); param(-name=>'form', -value=>6); param(-name=>'Dopt', -value=>'g'); param(-name=>'html', -value=>'no'); param(-name=>'uid', -value=>'M12345'); $query = query_string(); # yields "db=n&form=6&Dopt=g&html=no&uid=M1234" |
You can now use $query to construct the URL passed to LWP's get() function.
|
| Contents | Next |