Home C. elegans AcePerl Gramene Reactome GMOD Course DAS WWW

BoulderIO Applications

The following applications have been built on top of BoulderIO and are maintained at Cold Spring Harbor. They are all part of the main distribution and can be found in the eg subdirectory.

quickblast

Synopsis:

quickblast.pl [options] -source source_file search_file

Run BLAST on one or more sequences and summarize the results.  The
source database is an ordinary fasta file.  The program runs pressdb
to create a temporary blast database in /usr/tmp (or other location of
TMPDIR).

Options:
       -source  database  Source fasta database (no default)
       -dir     path      Where to save intermediate results in directory (don't save)
       -tmp     path      Scratch directory
       -program path      Variant of BLAST to run (blastn)
       -params  string     Parameters to pass to program
       -minlen  float      Minimum HSP length, as fraction of total search length (0.0)
       -cutoff  float      Minimum significance cutoff
       -tabular           Produce output in tabular format
       -boulder           Produce output in boulder format (default)

This program is a front end to the Washington University version of BLAST program for comparing nucleotide and protein sequences. Given two FASTA-format files, it handles the details of invoking pressdb or setdb to create searchable databases. It then performs a M X N comparison of the contents of the two files.

You can set the output from this script to be any of the following:

  1. A tabular summary table by specifying the -tabular option.
  2. An easily-processed BoulderIO stream by specifying the -boulder option (this is also the default).
  3. A directory of raw BLAST files. This can be specified in addition to either of the other two formats.

gb_search

Synopsis:

gb_search [options] query string

Query GenBank for a list of accession numbers.  The query
string should be in the form recognized by  NCBI's term parser. 
See http://www.ncbi.nlm.nih.gov/Entrez/linking.html for examples.

Options:
       -db  database  Database to search (n)
       -max max      Max entries to return (100)
       -age days     Only fetch accessions entered days ago
       -verbose      Show brief description line
       -count        Just retrieve the count that would be retrieved

Database specifiers:
  m  MEDLINE
  p  Protein
  n  Nucleotide
  t  3-D structure
  c  Genome
Example search:
  gb_search -verbose -db n 'Oryza sativa[Organism] AND EST[Keyword]'
Some common field modifiers:
  [All Fields]
  [Accession]
  [Author Name]
  [Feature Key]
  [Gene Name]
  [Keyword]
  [Organism]

This program is a command-line front end to the Genbank/EMBL Entrez system. Given a query in Entrez's query format, this program will perform a search and return a list of accession numbers that satisfy the query on standard output. The list can be piped to the standard input of gb_fetch in order to retrieve the entries.

Command-line options allow you to set the database to search, indicate the maximum number of entries to retrieve, or limit the age of the entries to retrieve (for use in programs that retrieve "new" entries). You may also toggle verbose reporting, in which the brief description lines of each retrieved entry is also displayed.

This program requires the LWP library, which can be found on the Comprehensive Perl Archive Network.

Example Session:

List all Tetrahymena ribosomal RNAs
% gb_search -v 'Tetrahymena pyriformis[Organism] AND rRNA'

AF010381	Tetrahymena pyriformis B 23S ribosomal RNA gene, partial sequence
AF010380	Tetrahymena pyriformis A 23S ribosomal RNA gene, partial sequence
AF010376	Tetrahymena pyriformis 23S ribosomal RNA gene, partial sequence
AF013937	Tetrahymena pyriformis 23S ribosomal RNA gene, partial sequence
J02668	T.pyriformis mitochondrial left large ribosomal RNA beta subunit (21S rRNA) gene, clone TmHi1.3
M14093	T.pyriformis mitochondrial right ribosmal RNA large (beta) subunit (rRNA) gene, clone TmHi0.8
Z22614	T.pyriformis polyubiquitin and 5S rRNA genes
X56171	T.pyriformis gene for small subunit ribosomal RNA (16S like rRNA)
V01413	Fragment of the Tetrahymena gene coding for 35S ribosomal RNA
X54004	Tetrahymena pyriformis gene for 26S large subunit ribosomal RNA
X01533	Tetrahymena pyriformis 5.8S ribosomal RNA
X04822	Tetrahymena mitochondrial DNA for 21S ribosomal RNA (rRNA)
X05203	Tetrahymena mitochondrial small 14S rDNA
M58010	T.pyriformis mitochondrial ribosomal RNA large subunit, Leu- and Tyr-tRNA genes
M58011	T.pyriformis mitochondrial ribosomal RNA large subunit, Leu- and Met-tRNA genes
M19225	T.pyriformis 28S large subunit rRNA, 5' end
...

gb_get

Synopsis:

gb_get [options] accession1  accession2  ...

Retrieves Genbank entries from a list of accession
numbers.  If no accession numbers are present on the
command line, or if the magic "-" argument is given,
will read accession numbers from standard input.

Outputs a series of Boulder records for each accession
number.

Options:
  -accessor Entrez|Yank  Use the Entrez|Yank methods
                            for retrieving records.
  -delay    seconds     Seconds to sleep between retrievals (10)

Options may be abbreviated, i.e. -acc E

This program retrieves a series of Genbank/EMBL records from either a local or a remote database. The records are parsed and output in BoulderIO format as a long stream, making them convenient to pass into scripts that will further process the records.

If you have a copy of Genbank/EMBL indexed with the Yank program, you can fetch records very quickly from your local filesystem. Otherwise the program will contact Entrez to retrieve the information directly from NCBI. Choose the access mode with the -accessor argument. Entrez is the default if not otherwise specified.

Because gb_get may put a strain on the NCBI, there is a default 10 second wait between fetches when the Entrez method is used. You can adjust this value with the -delay option.

Example Session:

Fetch all Tetrahymena ribosomal RNAs
% gb_search 'Tetrahymena pyriformis[Organism] AND rRNA' \
     | gb_get

Organism=Tetrahymena pyriformis Eukaryotae; Alveolata; Ciliophora; Oligohymenophorea; Hymenostomatida; Tetrahymenina; Tetrahymena.
Source=Tetrahymena pyriformis.
Title=Comparison of Sequence Differences in a Variable 23S rRNA Domain among Sets of Cryptic Species of Ciliated Protozoa
Title=Direct Submission
Basecount={
  a=64
  c=35
  t=29
  g=62
}
Authors=Nanney,D.L., Park,C., Preparata,R. and Simon,E.M.
Authors=Nanney,D.L., Park,C., Preparata,R. and Simon,E.M.
Locus=AF010381      190 bp    DNA             INV       12-OCT-1997
Accession=AF010381
Keywords=.
Sequence=agcgggaatccggggagtcaggtcagacatcaaagggaaaactagaccaaactgggggttagagtccactgaggaagttagacttgagtaaaacagaagactggccgcatgcttcaagacacaggaaaggaatgagtagctggaaagcatagctgaggcgtcactcattgcgaagggggaatacgcggca
Definition=Tetrahymena pyriformis B 23S ribosomal RNA gene, partial sequence.
Journal=J. Eukaryot. Microbiol. (1997) In press
Journal=Submitted (27-JUN-1997) Ecology, Ethology, and Evolution, University of Illinois, 505 S. Goodwin Ave., Urbana, IL 61801, USA
Nid=g2507650
Features={
  Source={
    Organism=Tetrahymena pyriformis
    Strain=FL 20o
    Db_xref=taxon:5908
    Position=1..190 
    Note=coded in article as 137
  }
  Rrna={
    Product=23S ribosomal RNA
    Position=<1..>190 
    Note=D2 domain
  }
}
Reference=1  (bases 1 to 190)
Reference=2  (bases 1 to 190)
=
Organism=Tetrahymena pyriformis Eukaryotae; Alveolata; Ciliophora; Oligohymenophorea; Hymenostomatida; Tetrahymenina; Tetrahymena.
Source=Tetrahymena pyriformis.
Title=Comparison of Sequence Differences in a Variable 23S rRNA Domain among Sets of Cryptic Species of Ciliated Protozoa
Title=Direct Submission
Basecount={
  a=63
...

Installation

These programs were developed for use in my laboratory, and have not yet been sufficiently generalized. It may be necessary to make small adjustments in the source code in order to get these scripts to work in your hands. Here are some suggestions:

quickblast
The constant WUBLAST is hard-coded to point to the location of the blast, blastp, and tblastx binaries (and others). You will need to adjust this for your environment.

gb_get
The Boulder/Genbank.pm file contains hard-coded path names for the location of your site's yank index and GenBank/EMBL flat files. The constants in question are YANK and DEFAULT_GB_PATH. These will need to be adjusted if you are using a yank-indexed database. This is not necessary for the default Entrez mode.

Aside from these customizations, it may be necessary to change the path to the perl interpreter if it is installed in an unusual location. Currently the scripts expect Perl to be located at /usr/local/bin/perl.


Bug Reports

These tools have been in use for approximately one year in my laboratory but have not been used outside. Therefore they probably do contain bugs. Please send bug reports, feature requests and other comments to lstein@cshl.org

.
Lincoln D. Stein, lstein@cshl.org
Cold Spring Harbor Laboratory
Last modified: Wed Nov 18 15:08:40 EST 1998