Retrieving Data from GET Scripts

Perl can act as a Web browser to retrieve URLs on your behalf. There is a simple way that works only for GET scripts, and a more complex way that works for both GET and POST scripts as well.

Simple GETs with LWP::Simple

This program retrieves the GenBank flat file document given an accession number. All the magic happens in get(), which fetches the page pointed to by a URL and returns its contents as a string.

Code:


#!/usr/bin/perl -w
# file: fetch_gb.pl

use LWP::Simple;

my $NCBI_URL =
  'http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=n&form=6&Dopt=g&html=no&uid=';

my $accession = shift or die "Provide an accession number on the command line\n";
my $url = $NCBI_URL  . $accession;

my $record = get($url);
print $record;

Output:

(~) 69% fetch_gb.pl M12345
Entrez Reports
----------------
LOCUS       MUSMYCN      1540 bp    DNA             ROD       27-APR-1993
DEFINITION  Mouse (ST4) c-myc proto-oncogene, promoter region.
ACCESSION   M12345
NID         g199964
VERSION     M12345.1  GI:199964
KEYWORDS    myc proto-oncogene; proto-oncogene.
SOURCE      Mouse (cell line ST4, from S-MuLV infected BALB/c mouse) DNA.
  ORGANISM  Mus musculus
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia;
            Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.
REFERENCE   1  (bases 1 to 1502)
  AUTHORS   Corcoran,L.M., Cory,S. and Adams,J.M.
  TITLE     Transposition of the immunoglobulin heavy chain enhancer to the myc
            oncogene in a murine plasmacytoma
  JOURNAL   Cell 40, 71-79 (1985)
  MEDLINE   85099331
REFERENCE   2  (bases 1 to 1540)
  AUTHORS   Corcoran,L.M.
  JOURNAL   Unpublished (1986)
COMMENT     A printed copy of the sequence in [1],[2] was kindly provided by
            L.M.Corcoran 21-OCT-1985.
FEATURES             Location/Qualifiers
     source          1..1540
                     /organism="Mus musculus"
                     /db_xref="taxon:10090"
     misc_feature    72..73
                     /note="ST4 proviral insertion site"
     misc_feature    96..97
                     /note="Tikaut proviral insertion site"
     misc_feature    876..877
                     /note="ST1 proviral insertion site"
     allele          1124
                     /note="g in ST4; a in ABPC17"
     misc_feature    1129..1130
                     /note="ABPC17 Ig H-chain enhancer insertion site"
     mRNA            1491..>1540
                     /note="myc mRNA"
BASE COUNT      366 a    392 c    426 g    356 t
ORIGIN      1 bp upstream of XbaI site; chromosome 15.
        1 tctagaacca atgcacagag caaaagactc atgtttctgg ttggttaata agctagatta
       61 tcgtgtatat ataaagtgtg tatgtatacg tttggggatt gtacagaatg cacagcgtag
      121 tattcaggaa aaaggaaact gggaaattaa tgtataaatt aaaatcagct tttaattagc
      181 ttaacacaca catacgaagg caaaaatgta acgttacttt gatctgatca gggccgactt
      241 ttttttttaa gtgcataatt acgattccag taataaaagg ggaaagcttg ggtttgtcct
      301 gggaggaagg ggttaacggt tttctttatt ctagggtctc tgcaggctcc ccagatctgg
      361 gttggcaatt cactcctccc cctttctggg aagtccgggt tttccccaac cccccaattc
      421 atggcatatt ctcgcgtcta gccttgattt tccccacccc agctcctaaa ccagagtctg
      481 ctgcaaactg gctccacagg ggcaaagagg atttgcctct tgtgaaaacc gactgtggcc
      541 ctggaactgt gtggaggtgt atggggtgta gaccggcaga gactcctccc ggaggagccg
      601 gtagagcgca cccgccgcca ctttactgga ctgcgcaggg agacctacag gggaaagagc
      661 cgcctccaca ccacccgccg gtggaagtcc gaaccggagg tgctggagtg tgtgtgtggg
      721 gggggggggg ggaatctgcc ttttggcagc aaattggggg gggggtcgtt ctggaaagaa
      781 tgtgcccagt caacataact gtacgaccaa aggcaaaata cacaatgcct tccccgcgag
      841 atggagtggc tgtttatccc taagtggctc tccaagtata cgtggcagtg agttgctgag
      901 caattttaat aaaattccag acatcgtttt tcctgcatag acctcatctg cggttgatca
      961 ccctctatca ctccacacac tgagcggggg ctcctagata actcattcgt tcgtccttcc
     1021 ccctttctaa attctgtttt ccccagcctt agagagacgc ctggccgccc gggacgtgcg
     1081 tgacgcggtc cagggtacat ggcgtattgt gtggagcgag gcagctgttc cacctgcggt
     1141 gactgatata cgcagggcaa gaacacagtt cagccgagcg ctgcgcccga acaaccgtac
     1201 agaaagggaa aggactagcg cgcgagaaga gaaaatggtc gggcgcgcag ttaattcatg
     1261 ctgcgctatt actgtttaca ccccggagcc ggagtactgg actgcgggct gaggctcctc
     1321 ctcctctttc cccggctccc cactagcccc ctcccgagtt cccaaagcag agggcgggga
     1381 aacgagagga aggaaaaaaa tagagagagg tggggaaggg agaaagagag gttctctggc
     1441 taatccccgc ccacccgccc tttatattcc gggggtctgc gcggccgagg acccctggct
     1501 gcgctgctct cagctgccgg gtccgactcg cctcactcag 
//

How Do You Figure out the Magic URL?

For GET scripts it's easy. Just navigate to the page you want to fetch on the remote site and then write down the URL you see in the browser. You may need to fetch a few different pages to figure out which URL arguments to change to get the results you want.

To figure out the NCBI parameters was even easier: they tell you how to link to their pages in the document at http://www.ncbi.nlm.nih.gov/Entrez/linking.html.

Using CGI.pm to Escape CGI Parameters for You

The CGI parameters use a horrible encoding system for spaces and other funny characters, known as application/x-www-url-encoded. If you have punctuation in your URLs, you have to escape them by replacing them with %XX, where XX is the hexadecimal code for the character.

Yuck.

Here is an easy way to construct the query string using CGI.pm:


#!/usr/bin/perl -w
# file: fetch_gb.pl

use CGI qw(:standard);
use LWP::Simple

param(-name=>'db',   -value=>'n');
param(-name=>'form', -value=>6);
param(-name=>'Dopt', -value=>'g');
param(-name=>'html', -value=>'no');
param(-name=>'uid',  -value=>'M12345');

$query = query_string(); # yields "db=n&form=6&Dopt=g&html=no&uid=M1234"

You can now use $query to construct the URL passed to LWP's get() function.


<< Previous
Contents >> Next >>

Lincoln D. Stein, lstein@cshl.org
Cold Spring Harbor Laboratory
Last modified: Fri Oct 22 14:10:33 EDT 1999