Home Jade ACEDB BoulderIO Perl

Introduction to Human Genome Computing via the World Wide Web

Lincoln D. Stein
Whitehead Institute/MIT Center for Genome Research
Cambridge, MA, USA
TEL: 617 252-1900
FAX: 617 252-1902
Email: lstein@genome.wi.mit.edu

Since the first edition of book was published, the genome community has seen an explosion in the number and variety of resources available over the Internet. In large part this explosion is due to the invention of the World Wide Web, a system of document linking and integration that has gone from obscurity to commonplace in a mere five years.

The genome community was an early adopter of the Web, finding in it a way to publish its vast accumulation of data, and to express the rich interconnectedness of biological information. The Web is the home of primary data, of genome maps, of expression data, of DNA and protein sequences, of X-ray crystallographic structures, and of the genome project's huge outpouring of publications. This data, spread out among thousands of individual laboratories and community databases, is hotlinked throughout. Researchers who wish to learn more about a particular gene can (with a bit of patience) move from physical map to clone to sequence to disease linkage to literature references and back again, all without leaving the comfort of their Web browser application.

However the Web is much more than a static repository of information. The Web is increasingly being used as a front end for sophisticated analytic software. Sequence similarity search engines, protein structural motif finders, exon identifiers, and even mapping programs have all been integrated into the Web. Java applets are adding rapidly to Web browsers' capabilities, enabling pages to be far more interactive than the original click-fetch-click interface. It may soon be possible for biologists to do all their computational work with no more than a browser on their desktop computers.

This chapter is an illustrated tour of the World Wide Web from the genome biologist's perspective. It doesn't pretend to be a technical discussion of Web protocols or to explain how things work. Nor is there any attempt for this to be an exhaustive listing of all the myriad Web resources available. This would be exhausting for the reader and the author both! Instead I have attempted to "touch all the bases," showing you the range of resources available and giving guidance on how to learn more. Other chapters in this book delve more deeply into selected topics of Web and genome.

URLs (Universal Resource Locators) are great for interactive Web browsing but terrible when they appear on the printed page. To spare the typesetter many lost nights of sleep, I have gathered all the URLs up and placed them in a table at the end of this chapter. Within the body of the text I refer to them by descriptive names such as "Pedro's Home Page" rather than by their less-friendly addresses.

Equipment for the Tour

Web Browser

The Web was designed to run with any browser software, on any combination of hardware and operating system. However, the pace of change has outstripped many software developers. Although some Web sites can still be viewed with older browsers (such as the venerable NCSA Mosaic or the Windows Cello browser), many require advanced features found only in recent browsers from the Microsoft and Netscape companies. For most effective genome browsing, I recommend one of the following browsers:
  1. Netscape Navigator, 3.02 or higher
  2. Netscape Communicator 4.01 or higher
  3. Microsoft Internet Explorer, 3.01 or higher
These browsers can be downloaded for free from Netscape's and Microsoft's home pages. They are also available in shrink-wrapped form from most computer stores and mail order outfits.

Although it's good to have a recent version of the browser software installed, it may not be such a good idea to use the most recent, as these versions often contain bugs that cause frustrating crashes. Be particularly wary of "prerelease," "preview," and "beta" browser versions.

Internet Connection

A direct connection to the Internet is a necessity for Web browsing. All academic centers, government labs and nearly all private companies usually have a fast Internet connection of at least 56K bps (56,000 bits per second). This will be more than adequate for Web browsing purposes.

Home users usually dial into the Internet via an Internet service provider using a PPP (point-to-point protocol) or SLIP (serial line interface protocol) connection. For such users a modem 28.8 K bps modem or better is strongly recommended.

Genome Databases

The WWW Virtual Library: Genetics

[Figure 1] Although we could start our tour anywhere, a good place to begin is the genetics division of the WWW Virtual Library, a distributed topic-oriented collection of Web resources (Figure 1). This page contains links to several hundred sites around the world, organized by organism.

The list of organisms in the left-hand frame provides a quick way to jump to the relevant section. Click on the link labeled "Human" to see sites under this heading. Subheadings direct you to a variety of U.S. and international sites, as well as to chromosome-specific Web pages and search services.

Entrez

[Figure 2] We select the link for GenBank taking us to the home page of the National Center for Biotechnology Information (Figure 2). NCBI administers GenBank, the main repository for all published nucleotide sequencing information. Links from its home page will take you to GenBank, as well as to SwissProt, the protein sequence repository, OMIM, the Online Mendelian Inheritance in Man collection of genetic disorders, MMDB, a database of crystallographic structures, and several other important resources.

While there are several ways to access GenBank and the other databases, the most useful interface is the Entrez search engine, an integrated Web front end to many of the databases that NCBI supports. To access Entrez, we click on the labeled button in the navigation bar at the top of the window.

[Figure 3] This takes us to the Entrez welcome page shown in Figure 3. The links on this page point to Entrez's four main divisions:

1. PubMed Division
This is an interface to the MedLine bibliographic citation service. Some 9 million citations of papers in the biological and biomedical literature are available going back as far as 1966. Most citations are accompanied by full abstracts.

2. Nucleotide Database
This is the GenBank collection of nucleotide sequences, now merged with the EMBL database.

3. Protein Database
This database combines primary protein sequencing data from SwissProt and other protein database collections, with protein sequences derived from translated GenBank entries.

4. 3-D Structures Database
This contains protein 3D structural information derived from X-ray crystallography and NMR. The source of the structures is the MMDB (Molecular Modeling Database) maintained at Brookhaven National Laboratories.

5. Genomes Database
This is a compilation of genetic and physical maps from a variety of species. Maps of similar regions are integrated to allow for comparisons among them.

6. Taxonomy
This is the phylogenetic taxonomy used throughout GenBank. Its primary purpose is as a consultation guide to obscure species.
[Figure 4] A search on the nucleotide division will illustrate how Entrez works. For this example, we'll say that we're interested in information on the "sushi" family of repeats found in many serine proteases. Selecting the link labeled "Search the NCBI nucleotide database" displays a page similar to the one shown in Figure 4. There is a single large text field in which to type keyword search terms, as well as a popup menu that allows us to limit the search to certain database fields. Available fields depend on which database we're searching. In the case of the nucleotide database, there are nearly two dozen fields covering everything from the name of the author that submitted the sequence entry to the sequence length. In this case we accept the default, which is "All Fields." We type "sushi" into the text field and press the "Search" button.

[Figure 5] Searches rarely take longer than a few seconds to complete. The page that appears now (Figure 5) indicates that eight entries matched our search. This is a small enough number that we can display the entire list. In cases where too many matches are found, Entrez allows us to add new search terms, progressively narrowing down the search until the number of hits is manageable. We press the button labeled "Retrieve 8 documents."

[Figure 6] A page listing a series of GenBank entries that match the search now appears (Figure 6). This is a complex page with multiple options. Each entry is associated with a checkbox to its left. You may select all or a subset of the entires on the list and click the "Display" button at the top of the page. This will generate a summary page that reports on each of the selected entries. The popup menu at the top of the page allows you to choose the format of the report. Choices include the standard GenBank format, a list of bibliographic references for the selected entries, the list of protein "links", and the list of nucleotide "neighbors" (more on "links" and "neighbors" later).

You may also retrieve information about a single entry. Following each entry's description are a series of hypertext links, each linking to a page that gives more information about the entry. Depending on the entry, certain links may or may not be present. A brief description of these links is as follows:

GenBank report
This shows the raw GenBank entry in the form that most biologists are familiar with.

Sequence report
This is the GenBank entry in a friendlier text format.

FASTA report
Just the nucleotide sequence in the format accepted by the FASTA similarity searching program.

ASN.1 format
A structured format used by the NCBI databases (and almost no one else).

Graphical view
For sequences derived from cosmids, BACs and other contigs, this shows a graphical representation of the sequencing strategy.

XX genome links
If the entry corresponds to a sequence that has been placed on one or more physical or genetic maps, this link appears. Selecting it will jump to the Entrez Genomes division (see below). "XX" will be replaced by the number of maps the entry appears on.

XX MEDLINE links
If a published paper refers to the entry, this link appears. Selecting it will jump to the list of paper(s) in the Entrez bibliographic division.

XX structural links
As above, but for 3D structures (see below).

XX protein links
This corresponds to protein sequences related to the entry. If the entry contains an open reading frame (real or predicted) there will be at least one protein link.

XX nucleotide neighbors
Each nucleotide entry added to GenBank is routinely BLASTed against all previous entries (see below) to create precalculated sets of "neighbors" that share sequence similarity. If a nucleotide entry has any sequence similarity neighbors, this link will appear.
[Figure 7] To continue our example, we decide to investigate GenBank entry U78093, described as a Human "sushi-repeat-containing protein precursor." Selecting "1 MEDLINE links" takes us to a page that lists the one paper that refers to this entry (not shown), and prompts us to select the citation format to display the article in. Selecting the default format displays the citation shown in Figure 7. This article indicates that the gene in question is deleted in some retinitis pigmentosa patients, and offers us links (in the form of buttons) to related articles, other relevant DNA and protein sequences, and to entries in OMIM that deal with retinigis pigmentosa.

Returning to our original list of sushi sequences, we can now select the link labeled "9 nucleotide neighbors." This takes us to a list of all the sequence entries in GenBank that have significant BLAST homologies to U78093. Here we find several EST (expressed sequence tags) entries produced by the Washington University/Merck cDNA sequencing project. It is possible that some of them represent previously-undescribed members of the sushi family of serine proteases.

[Figure 8] The user interface for other Entrez divisions provides a similar search-link-follow interface. The exception is the genomes division, which, because it has fewer entries than the others, is entered through a straightforward listing of promiment organisms and the genome maps available for them. From the Entrez welcome page, we select "Search the NCBI genomes database" and then "Homo sapiens" from the list of prominent organisms (not shown). This leads us to a list of 26 maps (22 autosomes, one sex chromosome and three mitochondrial maps) from which we select human chromosome 14. This leads us to a page (Figure 8), that displays a single prominent image in the center. The image shows a series of genetic and physical maps published from a variety of sources, roughly aligned, with diagonal lines connecting common features. The image is "live", meaning that we can click on it to magnify areas or to view information about individuals maps. When the magnification is large enough to see individually mapped objects (sequences, genetic loci and STSs), clicking on them will take us to a page showing the object's GenBank record, where we can learn more about it in the manner described above.

If you are interested in a known physical or genetic region and wish to view it directly, the genomes division interface allows you to type in the names of the two mapped loci that define the region. The map will be expanded and scrolled to the proper area. You can then examine the map for interesting candidate genes near the region of interest.

Entrez's 3D-structure division contains entries for several thousand proteins and other macromolecules whose structures have been determined by X-ray crystallography and/or NMR. The entries are fully linked to related entries in the nucleotide, protein, citation, and genome divisions.

In order to get the most out of the 3D structures, you will need to install a "helper application" to view and explore the MMDB structure files. Two different helpers are supported by Entrez, Rasmol and Kinemage. Both are available in versions that run on Macintosh, Windows, and Unix systems. You will need to obtain and install one of these software packages, then configure your browser to launch it automatically to view a structure file. Full instructions can be found at Entrez's MMDB FAQ (frequently-asked question) page.

The search interface to the 3D structures division is nearly identical to the one used for nucleotide and protein sequences. You enter one or more keywords into a text field and press the "Search" button, optionally limiting the scope of the search to a particular field. However the retrieved entries will contain two links that we haven't seen before, Structure Summary and XX structure neighbors. The first link retrieves a page that describes the entry's structure in a standardized format. The second link indicates the presence of one or more entries that are structurally "similar" to the entry. "Similarity", in the case of 3D structures, is determined by an algorithm that measures the two molecule's volume of overlap.

[Figure 9] Searching for the term "sushi" in this case was ineffective, but searching for "serine protease" was more productive, recovering 136 entries with structural information. Selecting the "Structure Summary" link for any of the matching entries retrieves a page that gives information on the structural determination method and its citation. A series of popup menus and push buttons allow you to retrieve the 3D structure in a variety of formats. Selecting "RasMol" format (assuming that the RasMol viewer is installed) and pressing the "View" button launches the helper application (Figure 9). You are now free to rotate the image with the mouse, magnify it, adjust various display options, and save the structure to local disk for further exploration.

"Gene Map" and Unigene Databases

No tour of the NCBI's Web site is complete without a side trip to the "Gene Map of the Human Genome," a compendium of approximately 16,000 expressed sequences from the UniGene set that have been localized by radiation hybrid mapping (see Chapter 6). These maps were published in late 1996 by a consortium of research groups. Although the maps are already somewhat out of date, it is expected that these pages will be updated at regular intervals.

[Figure 10] From the NCBI home page, select the link labeled "Gene Map of the Human Genome." This leads you to a brightly colored page that offers a series of idiographs of human chromosomes. There are several ways to search this database. If the region you are interested in is defined cytogenetically, just click on the idiograph in the desired region. A page like that shown in Figure 10 will appear showing a list of all mapped expressed sequences in the area. Selecting the GenBank accession numbers of the retrieved sequences will bring up pages with further information about the sequences and how they were mapped. Alternatively, if the region of interest is defined by markers on the Genethon genetic map of the human genome, you can search for all expressed sequences located between any pair of Genethon markers.

There is also a more flexible, but less obvious interface to the gene map database. Look for an inconspicuous link labeled "Research Tools Page" at the bottom of the gene map home page. This will lead you to a page that links to various types of searches, including text-based searches and sequence searches. The latter search is of particular interest. It prompts you to paste a new unknown sequence into a text field. When you press the search button, the NCBI server performs a BLAST sequence similarity search against all the expressed sequences on the gene map. This is a rapid way to find sequence similarities to previously mapped expressed sequences, and may be helpful in certain positional cloning strategies.

Closely related to the gene map is the NCBI UniGene set, a collection of human transcripts that have been clustered in an attempt to create a set of unique expressed sequences (Chapter 9). The UniGene set was compiled from two primary sources, random cDNA sequencing efforts from Washington University and elsewhere (dbEST), and published genes from GenBank. UniGene can be browsed from the NCBI home page by following the link labeled "UniGene: Unique Human Gene Sequence Collection". There is both a map-oriented search facility that allows you to list expressed sequences that have been placed on a particular chromosome and a keyword search facility. The difference between the map search facilities offered in the Gene Map and UniGene pages is that the latter includes expressed sequences that have been assigned to chromosomes but not otherwise ordered.

Both search interfaces will eventually generate a listing of matching UniGene clusters, along with a short phrase that describes each one. Select the clusters of interest in order to see the individual dbEST and GenBank entries that comprise the set. You can then browse individual sequence entries.

SRS

Although Entrez might appear to be the be-all and end-all source of protein and nucleotide sequence information, this is not quite true. There are many smaller but well-curated databases of biological information that are not included among the databases that Entrez serves. These include the Prosite and Blocks databases of protein structural motifs, transcription factor databases, species-specific databases, and databases devoted to certain pathogens.

The SRS ("Sequence Retrieval System") is a Web-based system for searching among multiple sequence databases supported by the European Molecular Biology Laboratory (EMBL). In addition to the large EMBL sequence database, it cross references sequence information from approximately 40 other sequence databases (the precise number is slowly increasing). Among these databases are ones that hold protein and nucleotide sequence information, 3D structure, disease and phenotype information, and functional information.

[Figure 11] The SRS system is replicated among multiple sites in order to distribute network load. You can access it directly from its home page in Heidelberg, Germany, or follow a link from this page to locate the site nearest you (there are servers in Europe, Asia, the Pacific, and South America; North America is conspicuously absent). To begin a search, connect to an SRS site (Figure 11). You can use the Heidelberg server or one of its replicated sites, several of which you'll find via links in the WWW Virtual Library. Closer sites may have better response times.

The SRS home page contains links that you can follow to learn more about the SRS service. Click "Start" to begin a new SRS searching session.

You'll be asked to select the databases to search (not shown). There are 40-odd checkboxes on this page, each corresponding to a different source database. This may seem formidable at first, but fortunately the databases have been grouped by category into nucleotide sequence-related, protein-related, and so on. Check off the databases you're interested in searching. If you aren't familiar with a particular database, click its name to obtain a brief description. In our example, we'll search the motif databases for proteins containing the zinc finger motif. We select the Prosite and Blocks databases and press "Continue" to move to the next page.

[Figure 12] This takes us to a page similar to the one shown in Figure 12. The most important elements are a set of text fields at the bottom of the page, each one corresponding to a diffent field in the database(s) selected. The number of nature of the fields will depend on which databases are selected. If multiple databases are selected, SRS only displays the fields that are shared in common among them all. This can be a trap for the unwary: if you select too many databases to search at once, you may find that the only field displayed is the (usually unhelpful) entry ID field. Page back and unselect some databases, or uncheck the option labeled "Show only fields that selected databanks have in common."

Our example shows three fields labeled "ID", "AccNumber" and "Description". We can click on the name of each field in order to learn more about it, but in this case, it seems obvious that "Description" is the one we want. We type in "zinc" and press the "Do Query" button.

[Figure 13] This search results in a list of 65 matches (Figure 13) to entries in one or more of the selected databases. The format of each match is the name of the database, e.g. PROSITE, followed by a colon and the ID of the entry. We can now:

  1. Select an entry by clicking on its name, fetching information about it.
  2. Expand the search by selecting the checkboxes to the left of one or more entries and then choosing one of the buttons labeled "link", "save" or "view".
If we click on an entry's name, we'll be taken to its database record. What exactly is displayed will depend on the structure of the database. For members of the Prosite database, the record consists of a description of the structural motif and a list of all the entries in GenBank/EMBL that are known to contain this motif.

The more interesting operation is to select the checkboxes of a series of entries, then press the "link" button. This performs a search of the other databases in the SRS system and returns all entries that are cross-referenced with the selected entries. In our example, we select the C2H2 and C3HC4 zinc finger domains and press "link", taking us to a page that prompts us to select the databases to link to (not shown). We select the "Swissprot", "EMBL" and "Genbank" sequence databases and press "continue".

[Figure 14] The resulting page (Figure 14) lists 861 matches. We can scan through individual entries, or repeat the linking process to expand the scope of the search still further.

Other options on the search results page allow you to create and download reports on the selected matches in a variety of formats. Select the preferred format from the page's popup menu, and click either the "save" button to download the report to your local disk, or "view" to see the report into the browser. The report options available to you depend on the databases you have selected. Some databases offer only a simple text only report; others offer more options. For example, the protein databases offer a fancy on-screen Java hydrophobicity chart. Other reports offer the ability to search the databases by sequence similarity using either the FASTA or Smith-Waterman algorithms.

GDB

GDB, the Genome Database, is the main repository for all published mapping information generated by the Human Genome Project. It is a species-specific database: only Homo sapiens maps are represented. Among the information stored in GDB is:

[Figure 15] To access GDB, connect to its home page (Figure 15). GDB offers several different ways to search the maps:

A simple search
This search, accessible from GDB's home page, allows you to perform an unstructured search of the database by keyword or the ID of the record. For example, a keyword search for "insulin" retrieves a list of clones and STSs that have something to do either with the insulin gene or with diabetes mellitus.

structured searches
A variety of structured searches available via the link labeled "Other Search Options" allow you to search the database in a more deliberate manner. You may search for maps containing a particular region of interest (defined cytogenetically, by chromosome, or by proximity to a known marker) or for individual map markers based on a particular attribute (e.g. map position and marker type). GDB also offers a wizzy "Find a gene" interface that searches through the various aliases to find the gene you're searching for.
[Figure 16] Searches that recover individual map markers and clones will display them in a list of hypertext links similar to those displayed by Entrez and PDB. When you select an entry you'll be shown a page similar to Figure 16. Links on the page lead to citation information, information on maps this reagent has been assigned to, and cross-references to the GenBank sequence for the marker or clone. GDB holds no primary sequence information, but the Web's ability to interconnect databases makes this almost unnoticeable.

[Figure 17] A more interesting interface appears when a search recovers a map. In this case, GDB launches a Java applet to display it. If multiple maps are retrieved by the search, the maps are aligned and displayed side-by-side (Figure 17). A variety of settings allow you to adjust the appearance of the map, as well as to turn certain maps on and off. Double-clicking on any map element will display its GDB entry in a separate window.

Species-Specific Database

In addition to the large community databases like GenBank, EMBL and GDB, there are hundreds of smaller species-specific databases available on the Web. Although not offering the comprehensive range of the big databases, they are a good source of unfiltered primary data. In addition they may be more timely than the community databases because of the inevitable lag between data production and publication.

Notable sites in this category include:

Whitehead Institute/MIT Center for Genome Research
The data available at this Web site include genome-wide genetic and physical maps of the mouse, physical maps of the human, a genetic map of the rat, and human Chromosome 17 DNA sequence.

MGD (Mouse Genome Database)
This database, based at the Jackson Laboratory, contains mouse physical and genetic mapping information, DNA sequencing data, and a rich collection of mouse strains and mutants.

Stanford Human Genome Center
This is the site of an ongoing project to produce a high resolution radiation hybrid map of the human genome

FlyBase
This Web database, hosted at the University of Indiana, is a repository of maps, reagents, strains and citations for Drosophila melanogaster.

ACEDB
The ACEDB database stores mapping, sequencing, citational and developmental information on C. elegans and other organisms. The Genome Informatics Group at the University of Maryland maintains a Web site at the URL given below that provides interfaces both to the Caenorhabditis elegans database and to a variety of plant, fungal and prokaryotic genomes.

SGD (Saccharomyces Genome Database)
The Stanford Genome Center hosts the Saccharomyces Genome Database, a repository of everything that's worth knowing about yeast (now including the complete DNA sequence).

TIGR (The Institute for Genome Research)
The TIGR site contains partial and complete genomic sequences of a large number of prokaryotic, fungal and protozoal organisms. Its "Human Gene Index" is a search interface to the large number of human expressed sequences that have been produced by TIGR and other groups.

Washington University Genome Sequencing Center
This is the home of several large scale genome sequencing projects, including human and mouse EST sequencing, Caenorhabditis elegans genomic sequencing, and human genomics sequencing (primary chromosomes 2 and 7).

The Sanger Center
The Sanger Center is another source of extensive DNA sequencing information. Its projects include the genomic sequence of C. elegans, and human chromosomes 1,6,20, 22 and X. In addition to its sequencing efforts, the Sanger Center also produces chromosome-specific human radiation hybrid maps.

University of Washington
The University of Washington Genome Center is sequencing human chromosome 7 (in collaboration with Washington University), as well as the human HLA class I region and the mouse T cell receptor region.

Analytic Tools

We now turn our attention to the analytic tools available on the Web. A few years ago, the simplest type of sequence analysis was hindered by the need to find the right software for your computer, install it, learn the ins and outs of the interface, and format the data according to the program's needs. This has become much simpler recently as more and more of the standard computational tools in the molecular biologist's armamentorium have been put on-line with easy to use Web front ends. This section tours the tools that are currently available for the most frequent types of analyses.

BLAST Searches

The most basic computational tool is the BLAST search, a rapid comparison of a search sequence to a database of known sequences. This search is used routinely to determine whether a newly sequenced DNA has already been published in the literature, and, if not, to give some hint of its putative function by searching for related sequences.

Many Web sites offer BLAST search interfaces, including SRS (discussed above) and the NCSA Biologist's Workbench (discussed later). Probably the most widely-used interface is the one offered by the NCBI. To use this interface, connect to the NCBI BLAST search page and select the type of search to perform. NCBI offers both "basic" and "advanced" searches. The first uses sensible default parameters for the search. The latter allows you to fine tune the BLAST search parameters, something only recommended if you fully understand what you're doing (extensive on-line documentation on the BLAST algorithm is available at the NCBI site). Because of the recent release of large amounts of TIGR-specific EST data at the time this was written, NCBI was also offering BLAST searches restricted to the TIGR data set. This may no longer be available by the time you read this.

[Figure 18] For the purposes of our tour, we'll select the link pointing to the "basic" search, arriving at the page shown in Figure 18. The BLAST interface allows us to search for sequences using one of several different algorithms, selected from a pop-up menu near the top of the page. The algorithms are described in more detail in the on-line documentation, but I summarize them here for convenience:

blastn
The search sequence is compared directly to the database sequences, using parameters appropriate for nucleotides.

blastp
The search sequence is compared directly to the database sequences, using parameters appropriate for protein sequence.

blastx
The search nucleotide sequence is first translated into protein sequence in all six reading frames, then compared against a database of protein sequences.

tblastn
The search protein sequence is compared against a database of nucleotide sequences after translating each database entry into protein using all six reading frames.

tblastx
The search nucleotide sequence is compared against a database of nucleotide sequences, after translating both the search sequence and the database sequences into protein in all six reading frames.
The advantage of using one the blastx, tblastn or tblastx search methods is that it allows you to find matches to distantly related sequences. The disadvantage is that the searches become computationally intensive and may take an inordinate length of time.

In addition to selecting the algorithm, you'll also be asked to select the database to search. You may search the default "nr" database, which contains a list of all non-redundant nucleotide sequences known to GenBank, or restrict the search to various species-specific collections, ESTs, or to new entries submitted during the past month. We leave the default at "nr".

The next task is to enter the sequence itself. The BLAST interface offers to ways to specify a sequence. You may cut and paste the sequence directly into the large text field in the center of the page. Alternatively, if the search sequence is already a part of GenBank, you can select "Accession or GI" from the pop-up menu above the text field and enter the sequence's GenBank accession number in the text area.

If you choose to enter the raw sequence, you must be careful to use FASTA format. This format begins with the name of the sequence on the top line, preceded by a ">" sign. Following this is the sequence itself, which should contain no line numbers or spaces. The figure shows an example of a valid search sequence.

[Figure 19] We fill in the text field with our search sequence ("blunderglobin") and press the "Submit Query" button located above the text area. A few seconds later our query returns with a list of possible matches (Figure 19), ordered so that the most similar sequences are located at the top. We can now click on the links for the matches in order to view their GenBank entries.

The BLAST server may be slow during periods of heavy usage. At the bottom of the search page a pair of checkboxes and an additional "Submit Query" button allows you to have the BLAST server send the results of the search to your e-mail address. This allows you to launch several searches without waiting for each one to complete.

Primer Picking

Another bread-and-butter task is to pick a PCR primer pair from a DNA sequence in order to create an STS (sequence tagged site). The Whitehead Institute/MIT Center for Genome Research provides a handy on-line primer picking tool, an interface to its freeware PRIMER program.

[Figure 20] To use this resource, connect to the Whitehead's home page and select the link labeled "WWW Primer Picking." The interface is straightforward (Figure 20). Paste the DNA sequence into the large text field at the top of the form and press the button labeled "Pick Primers." The sequence should contain only the characters AGCTN and white space. Case is ignored. Although the program can handle large sequences, it is wise not to paste in sequences much longer than you need. There's no need to enter 20K of sequence in order to generate an STS 200 bp long.

In this example, we've pasted in the sequence for our unknown "blunderglobin" gene. After pressing the primer picking button, the program offers us five sets of primer pairs that define STSs ranging in size from 117 to 256 bp. We can accept these or page back to the previous page to change the primer picking parameters. The default parameters pick primers that satisfy PCR conditions used at the Whitehead and many other laboratories. However, all parameters are adjustable. You may adjust the PCR conditions, the preferred PCR product size, and the stringency of the primer picking. You may also designate regions that the program shall exclude from primer picking, or which it will attempt to include in the PCR product. Another option allows you to pick a third oligonucleotide within the PCR product for the purpose of certain protocols that use hybridization to detect the product.

Exon Prediction

As the rate of genomic DNA sequencing has increased, it has become ever more important to have tools that can predict the presence of genes in genomic sequence. While far from perfect, these exon prediction tools (also known as "gene finders" and "sequence annotators") can give you a first start at finding the location of potential genes.

The Baylor College of Medicine's Gene Finder program is the most straightforward of the exon prediction programs. To use it, connect to the Baylor Molecular Biology Computational Resources page and follow the links labeled "Services on the Web" and "The BCM Genefinder" (not shown). Paste the sequence into the large text field at the top of the page, and enter its name in the small text field where indicated. For certain long-running algorithms you are also asked to enter your e-mail address so that the results can be sent to you off-line. Sequences of up to 7 kb are accepted. For longer sequences, the Gene Finder page instructs you to use an e-mail interface instead.

This is the easy part. The hard part is choosing the exon prediction algorithm and parameter set from among the 20-odd possibilities that Gene Finder's page offers. While specific instructions are given in the search page, the default of "FGENEH" is most suitable for human genomic sequences. Other algorithms are better tuned for invertebrates, prokaryotes and fungus.

To test Gene Finder, we paste the first 4K of the human apolipoprotein CI sequence into the field and press "Perform Search." The result is shown below:

Name: ApoCI
First three lines of sequence:
TATCGCATGCAGCCCCCAGTCACGCATCCCCTGCTTGTTCAATCGATCACGACCCTCTCACGTGCACCCACTTAG
AGTTGTGAGCCCTTAAAAGGAACAGGGATTGCTCACTCGGGGAGCTCGGCTCTTGAGACAGGAATCTTGCCCATT
CCCCGAACGAATAAACCCCTTCCTTCGTTAACTCAGCGTCTGAGGAATTTTGTCTGCGGCTCCTCCTGCTACATT


fgeneh  Fri Jul 11 11:14:04 CDT 1997   ApoCI
 Nucleotides which are not A,C,G,T,R or Y were removed from your sequence.
 length of sequence -   2401
 number of predicted exons -  3
 positions of predicted exons:
    468 -     529 w=   7.68
    664 -     741 w=  13.76
   1986 -    2296 w=   5.94
 Length of Coding region-    451bp           Amino acid sequence -    149aa
LIKVLRAGQDLPTKPSSKDSECPSGLAMRLFLSLPVLVVVLSIVLEGPAPAQGTPDVSSA
LDKLKEFGNTLEDKARELISRIKQSELSAKMRLEPFPGHGRAGVCFWVEPWQMVQDEQIE
KKTSPGEADNIPLVTQLDLKVLRLQGQFP*
In this case Gene Finder identified three potential exons in this sequence. The second two correspond to known exons in the ApoCI gene. The first, however, is a false hit. It spans an area overlapping an untranscribed area and the 5' untranslated region. A 60% accuracy rate is typical of the current generation of exon identification tools.

In addition to the exon prediction service, Baylor offers a number of on-line tools for molecular biologists, including protein secondary structure prediction, sequence alignment, and a service that launches sequence similarity searches on a number of databaes.

[Figure 21] Another exon predictor is GRAIL (Gene Recognition and Assembly Internet Link), a service provided by the Oak Ridge National Laboratory. The GRAIL engine can detect other features in addition to exons, including poly adenylation sites, repeat sequences, and CpG islands. The interface is relatively simple (Figure 21). Choose the feature you wish to search for, and paste the sequence into the text field at the bottom of the page (scrolled out of view in the screenshot). Alternatively, a "file upload" button allows you to load the sequence directly from a text-only file on your local disk. The exact nature of each feature that GRAIL can search for is described in detail in the program's on-line manual.

Selecting "Grail 2 Exons" from the list of features and repeating our experiment with ApoCI gave the following results:

[grail2exons -> Exons]

      St Fr Start     End ORFstart ORFend     Score      Quality
   1-  f 2    664     741     636     821   100.000    excellent
   2-  f 1   1986    2121    1814    2296    98.000    excellent

[grail2exons -> Exon Translations]

3- SPEPLPLPPECPSGLAMRLFLSLPVLVVVLSIVLEGKSGMGELGS

4- FEPLPIFLAGPAPAQGTPDVSSALDKLKEFGNTLEDKARELISRIKQSEL
   SAKMRLEPFPGHGR
In this case, the two correct exons were identified.

NCSA Biology Workbench

The NCSA (National Center for Supercomputing Applications), was responsible for Mosaic, the graphical Web browser that set off the explosion of interest in the Web. It is also responsible for the Biology Workbench, an integrated package of several dozen protein and nucleotide sequence search and analysis tools.

In order to use Workbench you will need to register an account with the Workbench server. This is because Workbench allows you to save personal project data on the server itself. Your account name and password provides a way to return to the data and ensures a degree privacy. To create an account, go to the Biology Workbench home page (not shown), and select the link labeled "Account Set-Up." You will be prompted for a login name and password.

[Figure 22] After the account is created, you will be able to enter the service by following the link labeled "Welcome to the NCSA Biology Workbench." Figure 22 shows the main Biology Workbench screen. The menu bar at the top of the page contains five subdivisions labeled "Session Tools", "Protein Tools", "Nucleic Tools", "Alignment Tools" and "Report Bugs." The Protein, Nucleic acid and Alignment tool buttons lead to pages that run various analytical programs. "Session tools" allows you to save your work to a named "session", log your actions, and restore an old session at some later date. The meaning of "Report Bugs" should be obvious.

Workbench is confusing at first because of its many options. Once you understand its style, however, it is easy to use. The general strategy for a Workbench session is as follows:

  1. Import sequences into Workbench.
  2. Select one or more sequences to analyze.
  3. Select the analysis to perform.
  4. Run the analysis.
Importing Sequences. Select the type of analysis you wish to perform from the main Protein, Nucleic Acid and Alignment divisions. A scrolling list of possible analyses will appear (Figure 22), among which are options to add a new sequence and perform an SRS database search. The first option allows you to import a new sequence into the workbench by cutting and pasting into your browser. The second is an interface to the SRS search engine (see the section above). Sequences recovered from the SRS search engine can be imported into Workbench without an intermediate cut-and-paste step. The interfaces for these two options are similar to ones we've already seen. Once a sequence has been imported into Workbench, you can view it, edit it, or delete it.

[Figure 23] Selecting one or more sequences to analyze. Once sequences have been imported into Workbench, they will appear in a list below the menu bar (Figure 24). To the left of each sequence is a checkbox. To select the sequence(s) to analyze, just check the appropriate box.

Selecting the analysis to perform. The list of analyses spans the spectrum from sequence similarity searches to alignments to protein secondary structure prediction. What is displayed in the scrolling list depends on which of the major subdivisions you've selected. Only one analysis can be selected at a time. Some analyses require one sequence only to be selected, while others require two or more. In the example shown in Figure 23, we've imported and selected both the mRNA and genomic sequences for the human ApoCI gene and will be using the CLUSTALW algorithm to obtain a sequence alignment.

Run the analysis. Press the button labeled "Perform Selected Operation." Depending on the analysis, you may now be asked to view and adjust some of its parameters.

[Figure 24] The format of the output depends on the analysis. In the case of the attempted alignment between the genomic and mRNA ApoCI sequences, Workbench produces a large text file showing the expected alignment between the two sequences. A button at the bottom of the output prompts us to import this alignment file back into Workbench. Doing so gives us an "alignment" object which we can then view with one of oWorkbench's alignment display tools (Figure 24).

Interestingly (but not too surprisingly), the coding sequence that GenBank's two entries give for this gene are not quite the same. Regrettably there is no existing on-line artificial intelligence service that will help sort out this type of problem!

Physical Mapping Tools

Several Web sites offer you the ability to map new STSs to one or more of the existing physical maps of the human genome. Two of the more useful services are the Whitehead Institute/MIT Center for Genome research, which offers mapping services for its radiation hybrid map, and the Stanford Human Genome Center, which will place STSs on its high resolution radiation hybrid map. Used in conjunction with the primer picking service described above, both services provide you with a rapid way to map new clones.

To use either of the radiation hybrid mapping services, you must obtain DNAs from the same radiation hybrid screening panel that was used to construct the map (see Chapter 6 for full details on radiation hybrid mapping). The Whitehead map was constructed from the Genebridge 4 RH panel, while the Stanford map used the higher-resolution G3 mapping panel. DNAs are available from a number of biotech supply houses, including Research Genetics of Huntsville, Alabama. Each STS to be mapped must be amplified on the DNAs from the hybrid panel, then scored on agarose or acrylamide gel. For best results, all amplifications should be done in duplicate; results that are discrepant should be repeated or treated as unknown.

To place STSs on the Whitehead map, reformat the hybrid panel screening results in standard "radiation hybrid vector" format. The format looks like this:

sts_name1 0010010110000010000000110100011011100111001010012110011101010101001010001010001100011000011
sts_name2 0000011110000010000000110100000011100111001010012110011101010101001000001010001100011000011
...
Each digit is the result of the PCR on one of the radiation hybrid cell lines. "0" indicates that the PCR was negative (no reaction product), "1" indicates that it was positive, and "2" is used for "unknown" or "not done". The order of digits in the vector is important, and must correspond to the official order of the Genebridge 4 radiation hybrid panel. The correct order is given in the help page of the Whitehead server (see below), and identical to the order in which the DNAs are packaged when they are shipped by Research Genetics. You can place spaces within the vector in order to increase readability. The STS name should be separated from the screening data with one or more spaces or tabs.

[Figure 25] From the Whitehead home page, follow the link labeled "Map STSs relative to the human radiation hybrid map" (Figure 25). Enter your e-mail address where indicated, and cut and paste the PCR scores into the large textfield at the top of the page. It is important that you enter the correct e-mail address, as this is the only way in which you can be informed of the mapping results.

By default, the mapping results are returned in text form. If you wish to generate graphical pictures of the STSs placed on the Whitehead map, you must select the desired graphics format. Currently the PICT and GIF formats are available. The former is appropriate if you are using a Macintosh system. The latter is appropriate for Windows and other systems. Select the graphics format by choosing the appropriate radio button from the labeled set (scrolled out of site in the figure).

When you are satisfied with the settings, press the "Submit" button. You will receive a confirmation that the data has been submitted for mapping. The results will be returned to you via e-mail shortly (if the server is loaded, however, it may take several hours). If the STS was successfully mapped, the e-mail will list the chromosome it linked to, and its position relative to other markers on the Whitehead map. If requested, you will also receive a picture of the map (with the location of the newly mapped STSs marked in red) as an e-mail enclosure.

The Whitehead also offers access to its STS content-based physical map of the human genome. If you have screened one or more STSs against the CEPH mega-YAC library (see Chapter 2), you can use a search page located at the Whitehead site to determine which YAC contigs contain the YACs hit by your STSs. From this you can infer the position of the STSs relative to the Whitehead map. You can access this service by connecting to the Whitehead home page. Then follow the links labeled "Human Physical Mapping Project" and "Search for a YAC by its address".

To place STSs on the Stanford RH map prepare your data in a similar way, but using the G3 mapping panel. The other important difference is that the PCR result vectors should use an "R" rather than a "2" to indicate missing or discrepant data. For the Stanford service, data vectors should not contain white space.

Connect to the Stanford Genome Center's home page and follow the links to "RH Server" and then to "RHServer Web Submission." Enter your e-mail address and a reference number in the indicated fields. The e-mail address is vital to receive the mapping results. The reference number is an optional field that will be returned to you with the results and is intended to help you keep the results organized. If known, also enter the STS's chromosomal assignment into the field labeled Chromosome number. This information increases the ability of the mapping software to detect a valid linkage.

Now cut and paste the screening results into the large textfield and press the "Submit" button. Mapping results are typically returned via e-mail within a few minutes. The Stanford server returns the mapping results as a series of placements relative to genetic markers. For each STS, the server reports the closest genetic marker, its chromosome, and the distance, in centiRays, from the marker to the STS. Although no graphical display is provided for the mapping results, the retrieved information can be used in conjunction with the browsable maps available at the Stanford site in order to infer the location of the newly mapped STS relative to other STSs on the Stanford radiation hybrid map.

Other Analysis Tools

There are many more resources for computational biology than we could possibly fit into this tour. Fortunately the Web makes it easy to find them. They're never more than one or two jumps away from the following pages:
Pedro's BioMolecular Research Tools
A vast compendium of computational biology tools. Here you'll find both the on-line sort and the traditional sort that must be downloaded and installed. One caveat: this page hasn't been updated in over a year, however, and may be out of date by the time you read this.

Baylor College of Medicine Computational Resources
The BCM site that we visited earlier contains a large number of links to on-line analytic tools around the world.

Whitehead Institute Biocomputing Links
The Whitehead Institute maintains a well-organized and up to date listing of analytic tools for molecular biology.

Dan Jacobsen's Archive of Molecular Biology Software
This page lists links to nearly a hundred repositories of molecular biology software. If you can't find it here, it probably doesn't exist.

Conclusion

The Web has already revolutionized the way that biologists work and, to some extent, think. The next few years will see even more dramatic changes as the databases, analytic tools, and perhaps even the software used to acquire and manage primary data merge together. As large scale genomic sequencing swings into full gear, there will be a need for tools to allow physically distant collaborators to edit and annotate sequences, run analyses, share their results with the community, and take issue with other laboratories' findings. The Web will provide the essential infrastructure for those tools.

We can look forward to some interesting times ahead.

URLs

Web Browsers

Netscape Corporation
http://www.netscape.com/

Microsoft Corporation
http://www.microsoft.com/

Genome Databases

WWW Virtual Library, Genetics Division
http://www.ornl.gov/TechResources/Human_Genome/genetics.html

GenBank / National Center for Biotechnology Information (NCBI)
http://www.ncbi.nlm.nih.gov/

Entrez Home Page
http://www3.ncbi.nlm.nih.gov/Entrez/

Entrez MMDB FAQ Page
http://www3.ncbi.nlm.nih.gov/Entrez/struchelp.html

NCBI Gene Map of the Human Genome
http://www.ncbi.nlm.nih.gov/SCIENCE96/

SRS Home Page
http://www.embl-heidelberg.de/srs5/

GDB Home Page
http://gdbwww.gdb.org/

Whitehead Institute/MIT Center for Genome Research
http://www.genome.wi.mit.edu/

MGD Home Page
http://www.jax.org/

Stanford Human Genome Center
http://shgc.stanford.edu/

FlyBase
http://flybase.bio.indiana.edu/

ACEDB
http://probe.nalusda.gov:8300/other/

Saccharomyces Genome Database
http://genome-www.stanford.edu/Saccharomyces/

The Institute for Genome Research
http://www.tigr.org/

Washington University Genome Sequencing Center
http://genome.wustl.edu/gsc/gschmpg.html

The Sanger Center
http://www.sanger.ac.uk/

University of Washington
http://chimera.biotech.washington.edu/uwgc/

Analytic Tools

NCBI BLAST Searches
http://www3.ncbi.nlm.nih.gov/BLAST/

Whitehead PCR Primer Picking
http://www.genome.wi.mit.edu/

Baylor College of Medicine Computational Resources
http://condor.bcm.tmc.edu/home.html

GRAIL
http://avalon.epm.ornl.gov/Grail-1.3/

NCSA Biology Workbench
http://biology.ncsa.uiuc.edu/

Direct Link to Whitehead Radiation Hybrid Mapping Service
http://www.genome.wi.mit.edu/cgi-bin/contig/rhmapper.pl

Pedro's BioMolecular Research Tools
http://www.public.iastate.edu/~pedro/research_tools.html

Whitehead Institute Biology Resources
http://www.wi.mit.edu/bio/biology.html

Dan Jacobsen's Archive of Molecular Biology Software
http://www.gdb.org/Dan/softsearch/biol-links.html

Captions

Figure 1: The WWW Virtual Library, a good jumping-off point for genome resources on the Web.

Figure 2: The NCBI home page provides access to the huge GenBank sequence database.

Figure 3: The Entrez search engine provides access to Genbank's bibliographic, nucleotide, protein, structural and genome divisions.

Figure 4: Searching the nucleotide database for entries that refer to "sushi".

Figure 5: The "sushi" search finds 8 documents. We can either view them, or refine the search further.

Figure 6: Entrez presents search results as a list of hotlinks to GenBank entries.

Figure 7: The GenBank entry for accession number U78093.

Figure 8: The genomes division of Entrez has a graphical interface based on alignments among multiple maps.

Figure 9: Entrez's structural division uses external viewers to display and rotate 3D protein models.

Figure 10: The NCBI gene map allows you to search for expressed genes by name or position.

Figure 11: The SRS sequence search system links 40 different molecular biology databases.

Figure 12: SRS search pages allow you to perform structured (field-based) queries on one or more databases.

Figure 13: SRS displays search results as a series of hypertext links. Clicking on the buttons at top broadens the search to other databases by bringing in cross-references.

Figure 14: After broadening the SRS search shown in the previous figure, SRS now brings in entries from SwissProt and other databases.

Figure 15: The GDB home page provides access to the main repository for human genome mapping information.

Figure 16: GDB displays most entries using a text format like that shown here.

Figure 17: GDB maps are displayed using an interactive Java applet.

Figure 18: NCBI's BLAST page provides rapid sequence similarity searches for both protein and nucleotide sequences.

Figure 19: Searching for a match to the imaginary "blunderglobin" sequence using BLAST.

Figure 20: Primer picking with the Whitehead Institute's PRIMER tool.

Figure 21: The GRAIL site provides on-line nucleotide sequence feature finding services. The checkboxes allow you to select which features to search for.

Figure 22: The NCSA Biology Workbench has four main analytic subdivisions, selected among using the menu buttons at the top.

Figure 23: Performing an analysis with Biology Workbench is a matter of selecting the sequences to analyze and the analytic program to run.

Figure 24: A sequence alignment produced by Biology Workbench.

Figure 25: The Whitehead radiation hybrid mapping service allows you to place new STSs on the Whitehead radiation hybrid map by pasting in PCR amplification data.


Home Jade ACEDB BoulderIO Perl

Lincoln D. Stein, lstein@genome.wi.mit.edu
Whitehead Institute/MIT Center for Genome Research
Last modified: Wed Mar 25 13:15:15 EST 1998