| Home | Jade | ACEDB | BoulderIO | Perl |
Since the first edition of book was published, the genome community has seen an explosion in the number and variety of resources available over the Internet. In large part this explosion is due to the invention of the World Wide Web, a system of document linking and integration that has gone from obscurity to commonplace in a mere five years.
The genome community was an early adopter of the Web, finding in it a way to publish its vast accumulation of data, and to express the rich interconnectedness of biological information. The Web is the home of primary data, of genome maps, of expression data, of DNA and protein sequences, of X-ray crystallographic structures, and of the genome project's huge outpouring of publications. This data, spread out among thousands of individual laboratories and community databases, is hotlinked throughout. Researchers who wish to learn more about a particular gene can (with a bit of patience) move from physical map to clone to sequence to disease linkage to literature references and back again, all without leaving the comfort of their Web browser application.
However the Web is much more than a static repository of information. The Web is increasingly being used as a front end for sophisticated analytic software. Sequence similarity search engines, protein structural motif finders, exon identifiers, and even mapping programs have all been integrated into the Web. Java applets are adding rapidly to Web browsers' capabilities, enabling pages to be far more interactive than the original click-fetch-click interface. It may soon be possible for biologists to do all their computational work with no more than a browser on their desktop computers.
This chapter is an illustrated tour of the World Wide Web from the genome biologist's perspective. It doesn't pretend to be a technical discussion of Web protocols or to explain how things work. Nor is there any attempt for this to be an exhaustive listing of all the myriad Web resources available. This would be exhausting for the reader and the author both! Instead I have attempted to "touch all the bases," showing you the range of resources available and giving guidance on how to learn more. Other chapters in this book delve more deeply into selected topics of Web and genome.
URLs (Universal Resource Locators) are great for interactive Web browsing but terrible when they appear on the printed page. To spare the typesetter many lost nights of sleep, I have gathered all the URLs up and placed them in a table at the end of this chapter. Within the body of the text I refer to them by descriptive names such as "Pedro's Home Page" rather than by their less-friendly addresses.
Although it's good to have a recent version of the browser software installed, it may not be such a good idea to use the most recent, as these versions often contain bugs that cause frustrating crashes. Be particularly wary of "prerelease," "preview," and "beta" browser versions.
Home users usually dial into the Internet via an Internet service provider using a PPP (point-to-point protocol) or SLIP (serial line interface protocol) connection. For such users a modem 28.8 K bps modem or better is strongly recommended.
Although we could start our tour anywhere, a good place to begin is
the genetics division of the WWW Virtual Library, a
distributed topic-oriented collection of Web resources (Figure 1). This page
contains links to several hundred sites around the world, organized by
organism.
The list of organisms in the left-hand frame provides a quick way to jump to the relevant section. Click on the link labeled "Human" to see sites under this heading. Subheadings direct you to a variety of U.S. and international sites, as well as to chromosome-specific Web pages and search services.
We select the link for GenBank taking us to the home page
of the National Center for Biotechnology Information (Figure 2). NCBI administers
GenBank, the main repository for all published nucleotide sequencing
information. Links from its home page will take you to GenBank, as
well as to SwissProt, the protein sequence repository, OMIM, the
Online Mendelian Inheritance in Man collection of genetic disorders,
MMDB, a database of crystallographic structures, and several other
important resources.
While there are several ways to access GenBank and the other databases, the most useful interface is the Entrez search engine, an integrated Web front end to many of the databases that NCBI supports. To access Entrez, we click on the labeled button in the navigation bar at the top of the window.
This takes us to the Entrez welcome page shown in Figure 3. The links on this
page point to Entrez's four main divisions:
A search on the nucleotide division will illustrate how Entrez works.
For this example, we'll say that we're interested in information on
the "sushi" family of repeats found in many serine proteases.
Selecting the link labeled "Search the NCBI nucleotide database"
displays a page similar to the one shown in Figure 4. There is a single
large text field in which to type keyword search terms, as well as a
popup menu that allows us to limit the search to certain database
fields. Available fields depend on which database we're searching.
In the case of the nucleotide database, there are nearly two dozen
fields covering everything from the name of the author that submitted
the sequence entry to the sequence length. In this case we accept the
default, which is "All Fields." We type "sushi" into the text field
and press the "Search" button.
Searches rarely take longer than a few seconds to complete. The page
that appears now (Figure
5) indicates that eight entries matched our search. This is a
small enough number that we can display the entire list. In cases
where too many matches are found, Entrez allows us to add new search
terms, progressively narrowing down the search until the number of
hits is manageable. We press the button labeled "Retrieve 8
documents."
A page listing a series of GenBank entries that match the search now
appears (Figure 6). This
is a complex page with multiple options. Each entry is associated
with a checkbox to its left. You may select all or a subset of the
entires on the list and click the "Display" button at the top of the
page. This will generate a summary page that reports on each of the
selected entries. The popup menu at the top of the page allows you to
choose the format of the report. Choices include the standard GenBank
format, a list of bibliographic references for the selected entries,
the list of protein "links", and the list of nucleotide "neighbors"
(more on "links" and "neighbors" later).
You may also retrieve information about a single entry. Following each entry's description are a series of hypertext links, each linking to a page that gives more information about the entry. Depending on the entry, certain links may or may not be present. A brief description of these links is as follows:
To continue our example, we decide to investigate GenBank entry
U78093, described as a Human "sushi-repeat-containing protein
precursor." Selecting "1 MEDLINE links" takes us to a page that lists
the one paper that refers to this entry (not shown), and prompts us to
select the citation format to display the article in. Selecting the
default format displays the citation shown in Figure 7. This article
indicates that the gene in question is deleted in some retinitis
pigmentosa patients, and offers us links (in the form of buttons) to
related articles, other relevant DNA and protein sequences, and to
entries in OMIM that deal with retinigis pigmentosa.
Returning to our original list of sushi sequences, we can now select the link labeled "9 nucleotide neighbors." This takes us to a list of all the sequence entries in GenBank that have significant BLAST homologies to U78093. Here we find several EST (expressed sequence tags) entries produced by the Washington University/Merck cDNA sequencing project. It is possible that some of them represent previously-undescribed members of the sushi family of serine proteases.
The user interface for other Entrez divisions provides a similar
search-link-follow interface. The exception is the genomes division,
which, because it has fewer entries than the others, is entered
through a straightforward listing of promiment organisms and the
genome maps available for them. From the Entrez welcome page, we
select "Search the NCBI genomes database" and then "Homo sapiens" from
the list of prominent organisms (not shown). This leads us to a list
of 26 maps (22 autosomes, one sex chromosome and three mitochondrial
maps) from which we select human chromosome 14. This leads us to a
page (Figure 8), that displays a
single prominent image in the center. The image shows a series of
genetic and physical maps published from a variety of sources, roughly
aligned, with diagonal lines connecting common features. The image is
"live", meaning that we can click on it to magnify areas or to view
information about individuals maps. When the magnification is large
enough to see individually mapped objects (sequences, genetic loci and
STSs), clicking on them will take us to a page showing the object's
GenBank record, where we can learn more about it in the manner
described above.
If you are interested in a known physical or genetic region and wish to view it directly, the genomes division interface allows you to type in the names of the two mapped loci that define the region. The map will be expanded and scrolled to the proper area. You can then examine the map for interesting candidate genes near the region of interest.
Entrez's 3D-structure division contains entries for several thousand proteins and other macromolecules whose structures have been determined by X-ray crystallography and/or NMR. The entries are fully linked to related entries in the nucleotide, protein, citation, and genome divisions.
In order to get the most out of the 3D structures, you will need to install a "helper application" to view and explore the MMDB structure files. Two different helpers are supported by Entrez, Rasmol and Kinemage. Both are available in versions that run on Macintosh, Windows, and Unix systems. You will need to obtain and install one of these software packages, then configure your browser to launch it automatically to view a structure file. Full instructions can be found at Entrez's MMDB FAQ (frequently-asked question) page.
The search interface to the 3D structures division is nearly identical to the one used for nucleotide and protein sequences. You enter one or more keywords into a text field and press the "Search" button, optionally limiting the scope of the search to a particular field. However the retrieved entries will contain two links that we haven't seen before, Structure Summary and XX structure neighbors. The first link retrieves a page that describes the entry's structure in a standardized format. The second link indicates the presence of one or more entries that are structurally "similar" to the entry. "Similarity", in the case of 3D structures, is determined by an algorithm that measures the two molecule's volume of overlap.
Searching for the term "sushi" in this case was ineffective, but
searching for "serine protease" was more productive, recovering 136
entries with structural information. Selecting the "Structure
Summary" link for any of the matching entries retrieves a page that
gives information on the structural determination method and its
citation. A series of popup menus and push buttons allow you to
retrieve the 3D structure in a variety of formats. Selecting "RasMol"
format (assuming that the RasMol viewer is installed) and pressing the
"View" button launches the helper application (Figure 9). You are now
free to rotate the image with the mouse, magnify it, adjust various
display options, and save the structure to local disk for further
exploration.
From the NCBI home page, select the link labeled "Gene Map of the
Human Genome." This leads you to a brightly colored page that offers
a series of idiographs of human chromosomes. There are several ways
to search this database. If the region you are interested in is
defined cytogenetically, just click on the idiograph in the desired
region. A page like that shown in Figure 10 will appear showing a list of all mapped
expressed sequences in the area. Selecting the GenBank accession
numbers of the retrieved sequences will bring up pages with further
information about the sequences and how they were mapped.
Alternatively, if the region of interest is defined by markers on the
Genethon genetic map of the human genome, you can search for all
expressed sequences located between any pair of Genethon markers.
There is also a more flexible, but less obvious interface to the gene map database. Look for an inconspicuous link labeled "Research Tools Page" at the bottom of the gene map home page. This will lead you to a page that links to various types of searches, including text-based searches and sequence searches. The latter search is of particular interest. It prompts you to paste a new unknown sequence into a text field. When you press the search button, the NCBI server performs a BLAST sequence similarity search against all the expressed sequences on the gene map. This is a rapid way to find sequence similarities to previously mapped expressed sequences, and may be helpful in certain positional cloning strategies.
Closely related to the gene map is the NCBI UniGene set, a collection of human transcripts that have been clustered in an attempt to create a set of unique expressed sequences (Chapter 9). The UniGene set was compiled from two primary sources, random cDNA sequencing efforts from Washington University and elsewhere (dbEST), and published genes from GenBank. UniGene can be browsed from the NCBI home page by following the link labeled "UniGene: Unique Human Gene Sequence Collection". There is both a map-oriented search facility that allows you to list expressed sequences that have been placed on a particular chromosome and a keyword search facility. The difference between the map search facilities offered in the Gene Map and UniGene pages is that the latter includes expressed sequences that have been assigned to chromosomes but not otherwise ordered.
Both search interfaces will eventually generate a listing of matching UniGene clusters, along with a short phrase that describes each one. Select the clusters of interest in order to see the individual dbEST and GenBank entries that comprise the set. You can then browse individual sequence entries.
The SRS ("Sequence Retrieval System") is a Web-based system for searching among multiple sequence databases supported by the European Molecular Biology Laboratory (EMBL). In addition to the large EMBL sequence database, it cross references sequence information from approximately 40 other sequence databases (the precise number is slowly increasing). Among these databases are ones that hold protein and nucleotide sequence information, 3D structure, disease and phenotype information, and functional information.
The SRS system is replicated among multiple sites in order to
distribute network load. You can access it directly from its home
page in Heidelberg, Germany, or follow a link from this page to locate
the site nearest you (there are servers in Europe, Asia, the Pacific,
and South America; North America is conspicuously absent). To begin a
search, connect to an SRS site (Figure 11). You can use the Heidelberg server or one
of its replicated sites, several of which you'll find via links in the
WWW Virtual Library. Closer sites may have better response times.
The SRS home page contains links that you can follow to learn more about the SRS service. Click "Start" to begin a new SRS searching session.
You'll be asked to select the databases to search (not shown). There are 40-odd checkboxes on this page, each corresponding to a different source database. This may seem formidable at first, but fortunately the databases have been grouped by category into nucleotide sequence-related, protein-related, and so on. Check off the databases you're interested in searching. If you aren't familiar with a particular database, click its name to obtain a brief description. In our example, we'll search the motif databases for proteins containing the zinc finger motif. We select the Prosite and Blocks databases and press "Continue" to move to the next page.
This takes us to a page similar to the one shown in Figure 12. The most
important elements are a set of text fields at the bottom of the page,
each one corresponding to a diffent field in the database(s) selected.
The number of nature of the fields will depend on which databases are
selected. If multiple databases are selected, SRS only displays the
fields that are shared in common among them all. This can be a trap
for the unwary: if you select too many databases to search at once,
you may find that the only field displayed is the (usually unhelpful)
entry ID field. Page back and unselect some databases, or uncheck the
option labeled "Show only fields that selected databanks have in
common."
Our example shows three fields labeled "ID", "AccNumber" and "Description". We can click on the name of each field in order to learn more about it, but in this case, it seems obvious that "Description" is the one we want. We type in "zinc" and press the "Do Query" button.
This search results in a list of 65 matches (Figure 13) to entries in
one or more of the selected databases. The format of each match is
the name of the database, e.g. PROSITE, followed by a
colon and the ID of the entry. We can now:
The more interesting operation is to select the checkboxes of a series of entries, then press the "link" button. This performs a search of the other databases in the SRS system and returns all entries that are cross-referenced with the selected entries. In our example, we select the C2H2 and C3HC4 zinc finger domains and press "link", taking us to a page that prompts us to select the databases to link to (not shown). We select the "Swissprot", "EMBL" and "Genbank" sequence databases and press "continue".
The resulting page (Figure
14) lists 861 matches. We can scan through individual entries, or
repeat the linking process to expand the scope of the search still
further.
Other options on the search results page allow you to create and download reports on the selected matches in a variety of formats. Select the preferred format from the page's popup menu, and click either the "save" button to download the report to your local disk, or "view" to see the report into the browser. The report options available to you depend on the databases you have selected. Some databases offer only a simple text only report; others offer more options. For example, the protein databases offer a fancy on-screen Java hydrophobicity chart. Other reports offer the ability to search the databases by sequence similarity using either the FASTA or Smith-Waterman algorithms.
To access GDB, connect to its home page (Figure 15). GDB offers several different ways to
search the maps:
Searches that recover individual map markers and clones will display
them in a list of hypertext links similar to those displayed by Entrez
and PDB. When you select an entry you'll be shown a page similar to
Figure 16. Links on the
page lead to citation information, information on maps this reagent
has been assigned to, and cross-references to the GenBank sequence for
the marker or clone. GDB holds no primary sequence information, but
the Web's ability to interconnect databases makes this almost
unnoticeable.
A more interesting interface appears when a search recovers a map. In
this case, GDB launches a Java applet to display it. If multiple maps
are retrieved by the search, the maps are aligned and displayed
side-by-side (Figure
17). A variety of settings allow you to adjust the appearance of
the map, as well as to turn certain maps on and off. Double-clicking
on any map element will display its GDB entry in a separate window.
Notable sites in this category include:
Many Web sites offer BLAST search interfaces, including SRS (discussed above) and the NCSA Biologist's Workbench (discussed later). Probably the most widely-used interface is the one offered by the NCBI. To use this interface, connect to the NCBI BLAST search page and select the type of search to perform. NCBI offers both "basic" and "advanced" searches. The first uses sensible default parameters for the search. The latter allows you to fine tune the BLAST search parameters, something only recommended if you fully understand what you're doing (extensive on-line documentation on the BLAST algorithm is available at the NCBI site). Because of the recent release of large amounts of TIGR-specific EST data at the time this was written, NCBI was also offering BLAST searches restricted to the TIGR data set. This may no longer be available by the time you read this.
For the purposes of our tour, we'll select the link pointing to the
"basic" search, arriving at the page shown in Figure 18. The BLAST
interface allows us to search for sequences using one of several
different algorithms, selected from a pop-up menu near the top of the
page. The algorithms are described in more detail in the on-line
documentation, but I summarize them here for convenience:
In addition to selecting the algorithm, you'll also be asked to select the database to search. You may search the default "nr" database, which contains a list of all non-redundant nucleotide sequences known to GenBank, or restrict the search to various species-specific collections, ESTs, or to new entries submitted during the past month. We leave the default at "nr".
The next task is to enter the sequence itself. The BLAST interface offers to ways to specify a sequence. You may cut and paste the sequence directly into the large text field in the center of the page. Alternatively, if the search sequence is already a part of GenBank, you can select "Accession or GI" from the pop-up menu above the text field and enter the sequence's GenBank accession number in the text area.
If you choose to enter the raw sequence, you must be careful to use FASTA format. This format begins with the name of the sequence on the top line, preceded by a ">" sign. Following this is the sequence itself, which should contain no line numbers or spaces. The figure shows an example of a valid search sequence.
We fill in the text field with our search sequence ("blunderglobin")
and press the "Submit Query" button located above the text area. A
few seconds later our query returns with a list of possible matches
(Figure 19), ordered so
that the most similar sequences are located at the top. We can now
click on the links for the matches in order to view their GenBank
entries.
The BLAST server may be slow during periods of heavy usage. At the bottom of the search page a pair of checkboxes and an additional "Submit Query" button allows you to have the BLAST server send the results of the search to your e-mail address. This allows you to launch several searches without waiting for each one to complete.
To use this resource, connect to the Whitehead's home page and select
the link labeled "WWW Primer Picking." The interface is
straightforward (Figure
20). Paste the DNA sequence into the large text field at the top
of the form and press the button labeled "Pick Primers." The sequence
should contain only the characters AGCTN and white space. Case is
ignored. Although the program can handle large sequences, it is wise
not to paste in sequences much longer than you need. There's no need
to enter 20K of sequence in order to generate an STS 200 bp long.
In this example, we've pasted in the sequence for our unknown "blunderglobin" gene. After pressing the primer picking button, the program offers us five sets of primer pairs that define STSs ranging in size from 117 to 256 bp. We can accept these or page back to the previous page to change the primer picking parameters. The default parameters pick primers that satisfy PCR conditions used at the Whitehead and many other laboratories. However, all parameters are adjustable. You may adjust the PCR conditions, the preferred PCR product size, and the stringency of the primer picking. You may also designate regions that the program shall exclude from primer picking, or which it will attempt to include in the PCR product. Another option allows you to pick a third oligonucleotide within the PCR product for the purpose of certain protocols that use hybridization to detect the product.
The Baylor College of Medicine's Gene Finder program is the most straightforward of the exon prediction programs. To use it, connect to the Baylor Molecular Biology Computational Resources page and follow the links labeled "Services on the Web" and "The BCM Genefinder" (not shown). Paste the sequence into the large text field at the top of the page, and enter its name in the small text field where indicated. For certain long-running algorithms you are also asked to enter your e-mail address so that the results can be sent to you off-line. Sequences of up to 7 kb are accepted. For longer sequences, the Gene Finder page instructs you to use an e-mail interface instead.
This is the easy part. The hard part is choosing the exon prediction algorithm and parameter set from among the 20-odd possibilities that Gene Finder's page offers. While specific instructions are given in the search page, the default of "FGENEH" is most suitable for human genomic sequences. Other algorithms are better tuned for invertebrates, prokaryotes and fungus.
To test Gene Finder, we paste the first 4K of the human apolipoprotein CI sequence into the field and press "Perform Search." The result is shown below:
Name: ApoCI
First three lines of sequence:
TATCGCATGCAGCCCCCAGTCACGCATCCCCTGCTTGTTCAATCGATCACGACCCTCTCACGTGCACCCACTTAG
AGTTGTGAGCCCTTAAAAGGAACAGGGATTGCTCACTCGGGGAGCTCGGCTCTTGAGACAGGAATCTTGCCCATT
CCCCGAACGAATAAACCCCTTCCTTCGTTAACTCAGCGTCTGAGGAATTTTGTCTGCGGCTCCTCCTGCTACATT
fgeneh Fri Jul 11 11:14:04 CDT 1997 ApoCI
Nucleotides which are not A,C,G,T,R or Y were removed from your sequence.
length of sequence - 2401
number of predicted exons - 3
positions of predicted exons:
468 - 529 w= 7.68
664 - 741 w= 13.76
1986 - 2296 w= 5.94
Length of Coding region- 451bp Amino acid sequence - 149aa
LIKVLRAGQDLPTKPSSKDSECPSGLAMRLFLSLPVLVVVLSIVLEGPAPAQGTPDVSSA
LDKLKEFGNTLEDKARELISRIKQSELSAKMRLEPFPGHGRAGVCFWVEPWQMVQDEQIE
KKTSPGEADNIPLVTQLDLKVLRLQGQFP*
In this case Gene Finder identified three potential exons in this
sequence. The second two correspond to known exons in the ApoCI gene.
The first, however, is a false hit. It spans an area overlapping an
untranscribed area and the 5' untranslated region. A 60% accuracy
rate is typical of the current generation of exon identification
tools.
In addition to the exon prediction service, Baylor offers a number of on-line tools for molecular biologists, including protein secondary structure prediction, sequence alignment, and a service that launches sequence similarity searches on a number of databaes.
Another exon predictor is GRAIL (Gene Recognition and Assembly
Internet Link), a service provided by the Oak Ridge National
Laboratory. The GRAIL engine can detect other features in addition to
exons, including poly adenylation sites, repeat sequences, and CpG
islands. The interface is relatively simple (Figure 21). Choose the
feature you wish to search for, and paste the sequence into the text
field at the bottom of the page (scrolled out of view in the
screenshot). Alternatively, a "file upload" button allows you to load
the sequence directly from a text-only file on your local disk. The
exact nature of each feature that GRAIL can search for is described in
detail in the program's on-line manual.
Selecting "Grail 2 Exons" from the list of features and repeating our experiment with ApoCI gave the following results:
[grail2exons -> Exons]
St Fr Start End ORFstart ORFend Score Quality
1- f 2 664 741 636 821 100.000 excellent
2- f 1 1986 2121 1814 2296 98.000 excellent
[grail2exons -> Exon Translations]
3- SPEPLPLPPECPSGLAMRLFLSLPVLVVVLSIVLEGKSGMGELGS
4- FEPLPIFLAGPAPAQGTPDVSSALDKLKEFGNTLEDKARELISRIKQSEL
SAKMRLEPFPGHGR
In this case, the two correct exons were identified.
In order to use Workbench you will need to register an account with the Workbench server. This is because Workbench allows you to save personal project data on the server itself. Your account name and password provides a way to return to the data and ensures a degree privacy. To create an account, go to the Biology Workbench home page (not shown), and select the link labeled "Account Set-Up." You will be prompted for a login name and password.
After the account is created, you will be able to enter the service by
following the link labeled "Welcome to the NCSA Biology Workbench."
Figure 22 shows the main
Biology Workbench screen. The menu bar at the top of the page
contains five subdivisions labeled "Session Tools", "Protein Tools",
"Nucleic Tools", "Alignment Tools" and "Report Bugs." The Protein,
Nucleic acid and Alignment tool buttons lead to pages that run various
analytical programs. "Session tools" allows you to save your work to
a named "session", log your actions, and restore an old session at
some later date. The meaning of "Report Bugs" should be obvious.
Workbench is confusing at first because of its many options. Once you understand its style, however, it is easy to use. The general strategy for a Workbench session is as follows:
Selecting one or more sequences to analyze. Once
sequences have been imported into Workbench, they will appear in a
list below the menu bar (Figure 24). To the left of each sequence is a
checkbox. To select the sequence(s) to analyze, just check the
appropriate box.
Selecting the analysis to perform. The list of analyses spans the spectrum from sequence similarity searches to alignments to protein secondary structure prediction. What is displayed in the scrolling list depends on which of the major subdivisions you've selected. Only one analysis can be selected at a time. Some analyses require one sequence only to be selected, while others require two or more. In the example shown in Figure 23, we've imported and selected both the mRNA and genomic sequences for the human ApoCI gene and will be using the CLUSTALW algorithm to obtain a sequence alignment.
Run the analysis. Press the button labeled "Perform Selected Operation." Depending on the analysis, you may now be asked to view and adjust some of its parameters.
The format of the output depends on the analysis. In the case of the
attempted alignment between the genomic and mRNA ApoCI sequences,
Workbench produces a large text file showing the expected alignment
between the two sequences. A button at the bottom of the output
prompts us to import this alignment file back into Workbench. Doing
so gives us an "alignment" object which we can then view with one of
oWorkbench's alignment display tools (Figure 24).
Interestingly (but not too surprisingly), the coding sequence that GenBank's two entries give for this gene are not quite the same. Regrettably there is no existing on-line artificial intelligence service that will help sort out this type of problem!
To use either of the radiation hybrid mapping services, you must obtain DNAs from the same radiation hybrid screening panel that was used to construct the map (see Chapter 6 for full details on radiation hybrid mapping). The Whitehead map was constructed from the Genebridge 4 RH panel, while the Stanford map used the higher-resolution G3 mapping panel. DNAs are available from a number of biotech supply houses, including Research Genetics of Huntsville, Alabama. Each STS to be mapped must be amplified on the DNAs from the hybrid panel, then scored on agarose or acrylamide gel. For best results, all amplifications should be done in duplicate; results that are discrepant should be repeated or treated as unknown.
To place STSs on the Whitehead map, reformat the hybrid panel screening results in standard "radiation hybrid vector" format. The format looks like this:
sts_name1 0010010110000010000000110100011011100111001010012110011101010101001010001010001100011000011 sts_name2 0000011110000010000000110100000011100111001010012110011101010101001000001010001100011000011 ...Each digit is the result of the PCR on one of the radiation hybrid cell lines. "0" indicates that the PCR was negative (no reaction product), "1" indicates that it was positive, and "2" is used for "unknown" or "not done". The order of digits in the vector is important, and must correspond to the official order of the Genebridge 4 radiation hybrid panel. The correct order is given in the help page of the Whitehead server (see below), and identical to the order in which the DNAs are packaged when they are shipped by Research Genetics. You can place spaces within the vector in order to increase readability. The STS name should be separated from the screening data with one or more spaces or tabs.
From the Whitehead home page, follow the link labeled "Map STSs
relative to the human radiation hybrid map" (Figure 25). Enter your
e-mail address where indicated, and cut and paste the PCR scores into
the large textfield at the top of the page. It is important that you
enter the correct e-mail address, as this is the only way in which you
can be informed of the mapping results.
By default, the mapping results are returned in text form. If you wish to generate graphical pictures of the STSs placed on the Whitehead map, you must select the desired graphics format. Currently the PICT and GIF formats are available. The former is appropriate if you are using a Macintosh system. The latter is appropriate for Windows and other systems. Select the graphics format by choosing the appropriate radio button from the labeled set (scrolled out of site in the figure).
When you are satisfied with the settings, press the "Submit" button. You will receive a confirmation that the data has been submitted for mapping. The results will be returned to you via e-mail shortly (if the server is loaded, however, it may take several hours). If the STS was successfully mapped, the e-mail will list the chromosome it linked to, and its position relative to other markers on the Whitehead map. If requested, you will also receive a picture of the map (with the location of the newly mapped STSs marked in red) as an e-mail enclosure.
The Whitehead also offers access to its STS content-based physical map of the human genome. If you have screened one or more STSs against the CEPH mega-YAC library (see Chapter 2), you can use a search page located at the Whitehead site to determine which YAC contigs contain the YACs hit by your STSs. From this you can infer the position of the STSs relative to the Whitehead map. You can access this service by connecting to the Whitehead home page. Then follow the links labeled "Human Physical Mapping Project" and "Search for a YAC by its address".
To place STSs on the Stanford RH map prepare your data in a similar way, but using the G3 mapping panel. The other important difference is that the PCR result vectors should use an "R" rather than a "2" to indicate missing or discrepant data. For the Stanford service, data vectors should not contain white space.
Connect to the Stanford Genome Center's home page and follow the links to "RH Server" and then to "RHServer Web Submission." Enter your e-mail address and a reference number in the indicated fields. The e-mail address is vital to receive the mapping results. The reference number is an optional field that will be returned to you with the results and is intended to help you keep the results organized. If known, also enter the STS's chromosomal assignment into the field labeled Chromosome number. This information increases the ability of the mapping software to detect a valid linkage.
Now cut and paste the screening results into the large textfield and press the "Submit" button. Mapping results are typically returned via e-mail within a few minutes. The Stanford server returns the mapping results as a series of placements relative to genetic markers. For each STS, the server reports the closest genetic marker, its chromosome, and the distance, in centiRays, from the marker to the STS. Although no graphical display is provided for the mapping results, the retrieved information can be used in conjunction with the browsable maps available at the Stanford site in order to infer the location of the newly mapped STS relative to other STSs on the Stanford radiation hybrid map.
We can look forward to some interesting times ahead.
Figure 2: The NCBI home page provides access to the huge GenBank sequence database.
Figure 3: The Entrez search engine provides access to Genbank's bibliographic, nucleotide, protein, structural and genome divisions.
Figure 4: Searching the nucleotide database for entries that refer to "sushi".
Figure 5: The "sushi" search finds 8 documents. We can either view them, or refine the search further.
Figure 6: Entrez presents search results as a list of hotlinks to GenBank entries.
Figure 7: The GenBank entry for accession number U78093.
Figure 8: The genomes division of Entrez has a graphical interface based on alignments among multiple maps.
Figure 9: Entrez's structural division uses external viewers to display and rotate 3D protein models.
Figure 10: The NCBI gene map allows you to search for expressed genes by name or position.
Figure 11: The SRS sequence search system links 40 different molecular biology databases.
Figure 12: SRS search pages allow you to perform structured (field-based) queries on one or more databases.
Figure 13: SRS displays search results as a series of hypertext links. Clicking on the buttons at top broadens the search to other databases by bringing in cross-references.
Figure 14: After broadening the SRS search shown in the previous figure, SRS now brings in entries from SwissProt and other databases.
Figure 15: The GDB home page provides access to the main repository for human genome mapping information.
Figure 16: GDB displays most entries using a text format like that shown here.
Figure 17: GDB maps are displayed using an interactive Java applet.
Figure 18: NCBI's BLAST page provides rapid sequence similarity searches for both protein and nucleotide sequences.
Figure 19: Searching for a match to the imaginary "blunderglobin" sequence using BLAST.
Figure 20: Primer picking with the Whitehead Institute's PRIMER tool.
Figure 21: The GRAIL site provides on-line nucleotide sequence feature finding services. The checkboxes allow you to select which features to search for.
Figure 22: The NCSA Biology Workbench has four main analytic subdivisions, selected among using the menu buttons at the top.
Figure 23: Performing an analysis with Biology Workbench is a matter of selecting the sequences to analyze and the analytic program to run.
Figure 24: A sequence alignment produced by Biology Workbench.
Figure 25: The Whitehead radiation hybrid mapping service allows you to place new STSs on the Whitehead radiation hybrid map by pasting in PCR amplification data.
| Home | Jade | ACEDB | BoulderIO | Perl |