| Home | C. elegans | AcePerl | Gramene | Reactome | GMOD | Course | DAS | WWW |
The following applications have been built on top of BoulderIO and are maintained at Cold Spring Harbor. They are all part of the main distribution and can be found in the eg subdirectory.
Synopsis:
quickblast.pl [options] -source source_file search_file
Run BLAST on one or more sequences and summarize the results. The
source database is an ordinary fasta file. The program runs pressdb
to create a temporary blast database in /usr/tmp (or other location of
TMPDIR).
Options:
-source database Source fasta database (no default)
-dir path Where to save intermediate results in directory (don't save)
-tmp path Scratch directory
-program path Variant of BLAST to run (blastn)
-params string Parameters to pass to program
-minlen float Minimum HSP length, as fraction of total search length (0.0)
-cutoff float Minimum significance cutoff
-tabular Produce output in tabular format
-boulder Produce output in boulder format (default)
This program is a front end to the Washington University version of
BLAST program for comparing nucleotide
and protein sequences. Given two FASTA-format files, it handles the
details of invoking pressdb or setdb to
create searchable databases. It then performs a M X N
comparison of the contents of the two files.
You can set the output from this script to be any of the following:
Synopsis:
gb_search [options] query string
Query GenBank for a list of accession numbers. The query
string should be in the form recognized by NCBI's term parser.
See http://www.ncbi.nlm.nih.gov/Entrez/linking.html for examples.
Options:
-db database Database to search (n)
-max max Max entries to return (100)
-age days Only fetch accessions entered days ago
-verbose Show brief description line
-count Just retrieve the count that would be retrieved
Database specifiers:
m MEDLINE
p Protein
n Nucleotide
t 3-D structure
c Genome
Example search:
gb_search -verbose -db n 'Oryza sativa[Organism] AND EST[Keyword]'
Some common field modifiers:
[All Fields]
[Accession]
[Author Name]
[Feature Key]
[Gene Name]
[Keyword]
[Organism]
This program is a command-line front end to the Genbank/EMBL Entrez system. Given a query in Entrez's query format, this program will perform a search and return a list of accession numbers that satisfy the query on standard output. The list can be piped to the standard input of gb_fetch in order to retrieve the entries.
Command-line options allow you to set the database to search, indicate the maximum number of entries to retrieve, or limit the age of the entries to retrieve (for use in programs that retrieve "new" entries). You may also toggle verbose reporting, in which the brief description lines of each retrieved entry is also displayed.
This program requires the LWP library, which can be found on the Comprehensive Perl Archive Network.
% gb_search -v 'Tetrahymena pyriformis[Organism] AND rRNA' AF010381 Tetrahymena pyriformis B 23S ribosomal RNA gene, partial sequence AF010380 Tetrahymena pyriformis A 23S ribosomal RNA gene, partial sequence AF010376 Tetrahymena pyriformis 23S ribosomal RNA gene, partial sequence AF013937 Tetrahymena pyriformis 23S ribosomal RNA gene, partial sequence J02668 T.pyriformis mitochondrial left large ribosomal RNA beta subunit (21S rRNA) gene, clone TmHi1.3 M14093 T.pyriformis mitochondrial right ribosmal RNA large (beta) subunit (rRNA) gene, clone TmHi0.8 Z22614 T.pyriformis polyubiquitin and 5S rRNA genes X56171 T.pyriformis gene for small subunit ribosomal RNA (16S like rRNA) V01413 Fragment of the Tetrahymena gene coding for 35S ribosomal RNA X54004 Tetrahymena pyriformis gene for 26S large subunit ribosomal RNA X01533 Tetrahymena pyriformis 5.8S ribosomal RNA X04822 Tetrahymena mitochondrial DNA for 21S ribosomal RNA (rRNA) X05203 Tetrahymena mitochondrial small 14S rDNA M58010 T.pyriformis mitochondrial ribosomal RNA large subunit, Leu- and Tyr-tRNA genes M58011 T.pyriformis mitochondrial ribosomal RNA large subunit, Leu- and Met-tRNA genes M19225 T.pyriformis 28S large subunit rRNA, 5' end ...
Synopsis:
gb_get [options] accession1 accession2 ...
Retrieves Genbank entries from a list of accession
numbers. If no accession numbers are present on the
command line, or if the magic "-" argument is given,
will read accession numbers from standard input.
Outputs a series of Boulder records for each accession
number.
Options:
-accessor Entrez|Yank Use the Entrez|Yank methods
for retrieving records.
-delay seconds Seconds to sleep between retrievals (10)
Options may be abbreviated, i.e. -acc E
This program retrieves a series of Genbank/EMBL records from either a local or a remote database. The records are parsed and output in BoulderIO format as a long stream, making them convenient to pass into scripts that will further process the records.
If you have a copy of Genbank/EMBL indexed with the Yank program, you can fetch records very quickly from your local filesystem. Otherwise the program will contact Entrez to retrieve the information directly from NCBI. Choose the access mode with the -accessor argument. Entrez is the default if not otherwise specified.
Because gb_get may put a strain on the NCBI, there is a default 10 second wait between fetches when the Entrez method is used. You can adjust this value with the -delay option.
% gb_search 'Tetrahymena pyriformis[Organism] AND rRNA' \
| gb_get
Organism=Tetrahymena pyriformis Eukaryotae; Alveolata; Ciliophora; Oligohymenophorea; Hymenostomatida; Tetrahymenina; Tetrahymena.
Source=Tetrahymena pyriformis.
Title=Comparison of Sequence Differences in a Variable 23S rRNA Domain among Sets of Cryptic Species of Ciliated Protozoa
Title=Direct Submission
Basecount={
a=64
c=35
t=29
g=62
}
Authors=Nanney,D.L., Park,C., Preparata,R. and Simon,E.M.
Authors=Nanney,D.L., Park,C., Preparata,R. and Simon,E.M.
Locus=AF010381 190 bp DNA INV 12-OCT-1997
Accession=AF010381
Keywords=.
Sequence=agcgggaatccggggagtcaggtcagacatcaaagggaaaactagaccaaactgggggttagagtccactgaggaagttagacttgagtaaaacagaagactggccgcatgcttcaagacacaggaaaggaatgagtagctggaaagcatagctgaggcgtcactcattgcgaagggggaatacgcggca
Definition=Tetrahymena pyriformis B 23S ribosomal RNA gene, partial sequence.
Journal=J. Eukaryot. Microbiol. (1997) In press
Journal=Submitted (27-JUN-1997) Ecology, Ethology, and Evolution, University of Illinois, 505 S. Goodwin Ave., Urbana, IL 61801, USA
Nid=g2507650
Features={
Source={
Organism=Tetrahymena pyriformis
Strain=FL 20o
Db_xref=taxon:5908
Position=1..190
Note=coded in article as 137
}
Rrna={
Product=23S ribosomal RNA
Position=<1..>190
Note=D2 domain
}
}
Reference=1 (bases 1 to 190)
Reference=2 (bases 1 to 190)
=
Organism=Tetrahymena pyriformis Eukaryotae; Alveolata; Ciliophora; Oligohymenophorea; Hymenostomatida; Tetrahymenina; Tetrahymena.
Source=Tetrahymena pyriformis.
Title=Comparison of Sequence Differences in a Variable 23S rRNA Domain among Sets of Cryptic Species of Ciliated Protozoa
Title=Direct Submission
Basecount={
a=63
...
These programs were developed for use in my laboratory, and have not yet been sufficiently generalized. It may be necessary to make small adjustments in the source code in order to get these scripts to work in your hands. Here are some suggestions:
WUBLAST is hard-coded to point to the
location of the blast, blastp, and
tblastx binaries (and others). You will need to
adjust this for your environment.
YANK and DEFAULT_GB_PATH.
These will need to be adjusted if you are using a yank-indexed
database. This is not necessary for the default
Entrez mode.
Aside from these customizations, it may be necessary to change the path to the perl interpreter if it is installed in an unusual location. Currently the scripts expect Perl to be located at /usr/local/bin/perl.
These tools have been in use for approximately one year in my laboratory but have not been used outside. Therefore they probably do contain bugs. Please send bug reports, feature requests and other comments to lstein@cshl.org
.