Home C. elegans AcePerl Gramene Reactome GMOD Course DAS WWW

Distributed Sequence Annotation System (DAS)

Version 0.97

Lincoln D. Stein, Sean Eddy, Robin Dowell

July 26, 2000

This is a working document describing the rationale and protocol for a distributed sequence annotation system.

News

May 25, 2000 - Mailing List Created
A mailing list for DAS has been created at EBI. To subscribe, send mail to majordomo@alpha1.ebi.ac.uk with the command subscribe das in the body (not the subject line) of the mail message.

Rationale

The pace of human genomic sequencing has outstripped the ability of sequencing centers to annotate and understand the sequence prior to submitting it to the archival databases. Multiple third-party groups have stepped into the breach and are currently annotating the human sequence with a combination of computational and experimental methods. Their analytic tools, data models, and visualization methods are diverse, and it is self-evident that this diversity enhances, rather than diminishes, the value of their work.

The main risk of third-party annotation is that it may fracture knowledge about the genome. Instead of having a convenient one-stop source for genomic annotation, such as Entrez, researchers may have to check multiple Web sites for information about a particular region of interest, download the data in several different formats, and perform a manual integration in order to get the whole picture. Clearly, this is undesirable.

There are several possible approaches to this problem. One is for each of the annotation centers to submit their annotations to a centralized database, such as GenBank. However, this option raises a number of political and technical problems, not the least of which is the long-held tradition of GenBank and its sister databases of allowing only the sequence submitter to modify or comment on a GenBank entry. Another option would be a system which uses Web links to point from the GenBank entry to one or more annotation Web sites. Such a system is available now in the form of the NCBI LinkOut service. However, while this makes it easier for researchers to find third-party annotation sites, it does not solve the problem of data integration.

The solution that we advocate allows sequence annotation to be decentralized among multiple third-party annotators and integrated on an as-needed basis by client-side software. A single server is designated the "reference server." It serves essential structural information about the genome: the physical map which relates one entry to another (where an "entry" is an arbitrary segment of the sequence, such as a sequenced BAC or a contig), the DNA sequence for each entry, and the standard authorship information. Multiple sites then act as third-party "annotation servers." Using a web browser-like application, researchers can interrogate one or more annotation servers to retrieve features in a region of interest. The servers return the results using a standard data format, allowing the sequence browser to integrate the annotations and display them in graphical or tabular form. No attempt is made to automatically resolve contradictions between different third-party annotations. Indeed, it is the ability to facilitate comparison among different centers' annotations that distinguish this proposal. We currently have a working prototype of this system based on ACeDB servers and CGI scripts, and are now generalizing this architecture to support other client and server combinations.

The key development that is necessary for a successful distributed annotation system is the adoption of a standard format to describe sequence features. While almost any one of the existing standards could be adapted for this purpose, certain characteristics are very desirable:

  1. Handling of multiple levels of relative coordinates

    In the ideal world, the genome would be finished to the base pair, and we would be able to unambiguously refer to an annotation based on its position from the top of the chromosome. This will not happen for a very long time. For the conceivable future, the genome will consist of multiple segments of high confidence, related to one another by mapping information of lower confidence. In order to deal with annotations in this dynamic and changeable environment, the format must be able to deal with relative coordinates in which annotations are related to arbitrary hierarchical landmarks. For example, a "clone end" annotation may be related to the start of a contig, an "mRNA" annotation may be related to the clone end, and an "exon" annotation may be related to the start of the mRNA.

  2. Easily generated and parsed

    Experience has shown that it is difficult to convince groups to adopt complex and sophisticated data formats. For this reason, a "lowest common denominator" format is desirable, even if it sacrifices some of the expressiveness of the more sophisticated formats. A human-readable format, such as tab-delimited tables, XML, or even ".ace" format is also desirable.

  3. Extensibility

    Any format must be extensible to allow for new types of annotations. Specifically, we feel that it is desirable to create a category of annotation that has to do with the availability of experimental data concerning the region of interest. For example, the format should allow a researcher to note the presence of RNAi results overlapping the region of interest. The format should also provide a mechanism for pointing the researcher to a location where he or she can get more information about a selected annotation. In the ACeDB-based system, each annotation contains a pointer into an ACeDB entry somewhere on the Internet. This entry is in turn linked to related biological and experimental information.

  4. Functional groupings of annotations

    To further enhance the extensibility of the format, it is desirable to group specific annotations into functional categories rather than maintaining an unsorted "laundry list" of feature types. For example, splice sites, polyA signals, introns and exons are all annotations having to do with a generic "mRNA" category, while clone ends, primer pairs, and hybridization probes are "structural" features. Grouping annotations into conceptual categories makes the data more manageable, and facilitates formulating biologically relevant queries on the annotation servers.

The remainder of this document describes a simple client/server system that satisifes many of these requirements. Information on the ACEDB-prototype and a text-only demo can be found here.

Description of the System

This section provides a high-level view of the system architecture.

The Reference Sequence

The distributed annotation system (DAS) relies on there being a common "reference sequence" on which to base annotations. The reference sequence consists of a set of "entry points" into the sequence, and the lengths of each entry point. The identity of an entry point will vary from genome to genome. For some genome projects, entry points correspond to entire chromosomes. For others, entry points may be a series of contigs.

It is possible for each entry point to have a substructure, basically a series of subsequences and their starting and ending points. This structure is recursive. Annotations take the form of an assertion about a region of the reference sequence. Each annotation is unambiguously located by providing its position as the start and stop positions relative to a "reference sequence." The reference sequence can be one of the entry points, or any of the subsequences within the entry point.

To give a concrete example, the C. elegans reference map consists of six chromosome-length entry points. Each chromosome is formed from several contigs called "superlinks", and each superlink contains one or more smaller contigs called "links". Links in turn are composed of one or more fully-sequenced clones. One could refer to an annotation by specifying its start or stop positions in clone, link, superlink, or chromosome coordinates. The distributed annotation system automatically converts any coordinate system into any other. Because coordinates within clones are more stable to revisions than coordinates within links or chromosomes, it is recommended that annotation coordinates be stored relative to the smallest sequencing unit.

The hierarchy is extensible. If the C. elegans gene predictions were stable, it would make sense to store certain annotations, such as the positions of exons, relative to the transcriptional unit.

Reference and Annotation Servers

The DAS consists of a reference sequence server, and one or more annotation servers.

The reference sequence server is specialized to provide the reference sequence map and the underlying DNA. The server can provide a list of sequence entry points, and given an entry point can return its structure. The reference server can provide arbitrarily long stretches of DNA given a reference subsequence, start position and stop position, and is capable of translating from one coordinate system to another.

Annotation servers are specialized for returning lists of annotations across a certain region of the genome. Each annotation is anchored to the genome map by way of a start and stop position relative to one of the reference subsequences. Annotations have an ID that is unique to the server and a structured description that describes its nature and attributes. Annotations may also be associated with Web URLs that provide additional human readable information about the annotation.

Annotations have types, methods and categories. The annotation type is selected from a list of types that have biological significance, and correspond roughly to EMBL/GenBank feature table tags. Examples of annotation types include "exon", "intron", "CDS" and "splice3." The annotation method is intended to describe how the annotated feature was discovered, and may include a reference to a software program. The annotation category is a broad functional category that can be used to filter, group and sort annotations. "Homology", "variation" and "transcribed" are all valid categories. The existence of these categories allows researchers to add new annotation types if the existing list is inadequate without entirely losing all semantic value. The Annotation Categories section contains a list of the annotation types in use in the C. elegans project.

It is intended that larger annotation servers provide pointers to human-readable data that describes its types, methods and categories in more detail. Another optional feature of annotation servers is the ability to provide hints to clients on how the annotations should be rendered visually. This is done by returning a XML "stylesheet".

Although the servers are conceptually divided between reference servers and annotation servers, there is in fact no key difference between them. A single server can provide both reference sequence information and annotation information. The main functional difference is that the reference sequence server is required to serve the DNA itself, while annotation servers have no such requirement.

Client/Server Interactions

The DAS is Web-based. Clients query the reference and annotation servers by sending a formatted URL request to the server. This request must follow the conventions of the HTTP/1.0 protocol (see RFC2616. Servers process the request and return a response in the form of a formatted XML document (see W3C Extensible Markup Language).

The Request

All DAS requests take the form of a URL. Each URL has a site-specific prefix, followed by a standardized path and query string. The standardized path begins with the string /das. This is followed by URL components containing the data source name and a command. For example:

http://stein.cshl.org/das/elegans/features?ref=CHROMOSOME_I;start=1000;stop=2000
^^^^^^^^^^^^^^^^^^^^^     ^^^^^^^ ^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 site-specific prefix    data src  command             arguments
In this case, the site-specific prefix is http://stein.cshl.org/. The request begins with the standardized path /das, and the data source, in this case /elegans. This is followed by the command /features, which requests a list of features, and a query string providing named arguments to the /features command.

The data source component allows a single server to provide information on several genomes.

More information on the format of the request and the various available commands is given below.

The Response

The response from the server to the client consists of a standard HTTP header with DAS status information within that header followed optionally by an XML file that contains the answer to the query. The DAS status portion of the header consists of two lines. The first is X-DAS-Version and gives the current protocol version number, currently DAS/0.95. The second line is X-DAS-Status and contains a three digit status code which indicates the outcome of the request.

Here is an example HTTP header: (provided by Web server)

HTTP/1.1 200 OK                          
Date: Sun, 12 Mar 2000 16:13:51 GMT          
Server: Apache/1.3.6 (Unix) mod_perl/1.19    
Last-Modified: Fri, 18 Feb 2000 20:57:52 GMT 
Connection: close                            
Content-Type: text/plain                     
X-DAS-Version: DAS/0.95
X-DAS-Status: 200
data follows...

The defined status codes are listed in Table 1.

Table 1: DAS response codes
200 OK, data follows
400 Bad command (command not recognized)
401 Bad data source (data source unknown)
402 Bad command arguments (arguments invalid)
403 Bad reference object (reference sequence unknown)
404 Bad stylesheet (requested stylesheet unknown)
405 Coordinate error (sequence coordinate is out of bounds/invalid)
500 Server error, not otherwise specified
501 Unimplemented feature

The Queries

This section lists the queries recognized by sequence and annotation servers. Each of these queries begins with some site-specific prefix, denoted here as PREFIX. The other meta-variable used in these examples is DSN, which is a symbolic data source. Data sources are standardized across DAS servers in such a way that a data source name has a one-to-one correspondence with a reference sequence.


Retrieve the List of Data Sources

Scope: Sequence and annotation servers.

Command: dsn

Format:

PREFIX/das/dsn
This query returns the list of data sources that are available from this server. Return value: See here.

Retrieve the List of Entry Points for a Data Source

Scope:Sequence and annotation servers.

Command: entry_points

Format:

PREFIX/das/DSN/entry_points[?ref=REF]
This query returns the list of sequence entry points available and their sizes in base pairs.

Arguments:

ref (optional)
If a sequence reference ID is provided in the ref argument, the query will return the components of the sequence (its subsequences) rather than the list of top-level entry point sequences.

type (optional)
For ACEDB servers, the type parameter provides the class of the reference sequence, sequence by default.
Return Value: See here.

Retrieve the DNA Associated with a Subsequence

Scope: Sequence servers.

Command: dna

Format:

PREFIX/das/DSN/dna?ref=REF[;start=X;stop=Y]
This query returns the DNA corresponding to the indicated segment.

Arguments:

ref (required)
The ID of a sequence landmark (an entry point or subsequence).

start (optional)
The start position of the segment, where 1 is the first base pair in the sequence. If this argument is provided stop must also be provided. If not provided, then the entire length of the entry point or subsequence is returned. Zero and negative numbers are acceptable. If the coordinate is off the end of the reference sequence (but not off the end of the genome), the server performs the necessary join.

end (optional)
The end position of the segment; mandatory if start is provided. If start < end then the request returns the reverse complement of the segment.

Return Value: See here.

Resolve Coordinates

Scope: Sequence and annotation servers.

Command: resolve

Format:

PREFIX/das/DSN/resolve?segment=segmentID;ref=REF
                                     [;start=X;stop=Y]
This query transforms the coordinates of DNA segment segmentID into coordinates relative to the reference sequence REF.

Arguments:

segment (required)
The ID of a sequence landmark or annotation to be transformed.

ref (required)
The reference sequence that defines the coordinate system to be used.

start (optional)
The start position of the segment relative to segmentID.

end (optional)
The end position of the segment relative to segmentID. This argment is mandatory if start is provided. If start < end then the request applies to the reverse complement of the segment.
Return Value: See here.

Retrieve the Types Available for a Segment

Scope: Annotation Servers

Command: types

Format:

PREFIX/das/DSN/types?ref=REF
                                      [;start=X;stop=Y]
                                      [;type=TYPEPATTERN]
This query returns the annotation available for a segment of sequence.

Arguments:

ref (required)
The ID of a sequence landmark (an entry point or subsequence).

start (optional)
The start position of the segment, where 1 is the first base pair in the sequence. If this argument is provided stop must also be provided. If not provided, then the entire length of the entry point or subsequence is returned. Zero and negative numbers are acceptable. If the coordinate is off the end of the reference sequence (but not off the end of the genome), the server performs the necessary join.

end (optional)
The end position of the segment; mandatory if start is provided. If start < end then the request returns the reverse complement of the segment.

type (optional)
A GNU regular expression to be used for filtering annotations on the type field. Regular expressions follow the syntax of GNU regular expressions: see manual page regex(7).
Return Value: See here.

Retrieve the Annotations Across a Segment

Scope: Sequence and annotation Servers

Command: features

Format:

PREFIX/das/DSN/features?ref=REF
                                      [;start=X&stop=Y]
                                      [;type=TYPEPATTERN]
                                      [;category=CATEGORYPATTERN]
                                      [;categorize=yes|no]
This query returns the annotations across a segment of sequence.

Arguments:

ref (required)
The ID of a sequence landmark (an entry point or subsequence).

start (optional)
The start position of the segment, where 1 is the first base pair in the sequence. If this argument is provided stop must also be provided. If not provided, then the entire length of the entry point or subsequence is returned. Zero and negative numbers are acceptable. If the coordinate is off the end of the reference sequence (but not off the end of the genome), the server performs the necessary join.

end (optional)
The end position of the segment; mandatory if start is provided. If start < end then the request returns the reverse complement of the segment.

type (optional)
A GNU regular expression to be used for filtering annotations on the type field. Regular expressions follow the syntax of GNU regular expressions: see manual page regex(7).

category (optional)
A GNU regular expression to be used for filtering annotations by category field. Regular expressions follow the syntax of GNU regular expressions: see manual page regex(7). If both type and category are provided, they are combined by a logical OR.

categorize (optional)
Either "yes" or "no" (default). If "yes", then each annotation will include its functional category.
Return Value: See here. The positions of all returned annotations are given relative to the indicated reference sequence.

Linking to a Feature

Scope: Annotation Servers

Command: link

Format:

PREFIX/das/DSN/link?field=TAG;id=ID
This query can be issued in order to retrieve further human-readable information about an annotation. It is best to pass this URL directly to a browser, as the type of the returned data is not specified (it will typically be an HTML file, but any MIME format is allowed).

Arguments:

field (required)
The field to fetch further information on. Options are:

id (required)
The ID of the indicated annotation field.
Return Value: A web page.

Retrieving the Stylesheet

Scope: Annotation Servers

Command: stylesheet

Format:

PREFIX/das/DSN/stylesheet
This query can be issued to an annotation server in order to retrieve the server's recommendations on formatting annotations retrieved from it. These recommendations are not normative. A viewer is free to use any display format it chooses.

Arguments: None. Return Value: See Annotation Stylesheets.


Returned Documents

This section describes the format of the various documents that are returned in response to DAS queries.

The Data Sources Document

Scope: Sequence and annotation servers.

In Response to Command: dsn

Format:

<?xml version="1.0" standalone="no"?>
<!DOCTYPE DASDSN SYSTEM "dasdsn.dtd">
<DASDSN>
  <DSN>
    <SOURCE id="id1">source name 1</SOURCE>
    <MAPMASTER>URL<MAPMASTER>
    <DESCRIPTION>descriptive text 1</DESCRIPTION>
  </DSN>
  <DSN>
    <SOURCE id="id2">source name 2</SOURCE>
    <MAPMASTER>URL<MAPMASTER>
    <DESCRIPTION href="url">descriptive text 2</DESCRIPTION>
  </DSN>
  ...
</DASDSN>
<!DOCTYPE> (required; one only)
The doctype indicates which formal DTD specification to use. For the dsn query, the doctype DTD is "dasdsn.dtd".

<DASDSN> (required; one only)
The appropriate doctype and root tag is DASDSN.

<DSN> (required; one or more)
There are one or more <DSN> tags, one for each data source. Each <DSN> contains one <SOURCE> tag and optionally one <DESCRIPTION> tag.

<SOURCE> (required; one per DSN tag)
This tag indicates the symbolic name for a data source. The symbolic name to use for further requests can be found in the id (required) attribute. The tag body contains a human-readable label which may or may not be different from the ID.

<MAPMASTER> (required; one per DSN tag)
This tag contains the URL (site.specific.prefix/das/data_src) that is being annotated by this data source. For an annotation server, this is the reference server which is being annotated. For a reference server, this would echo it's own URL.

<DESCRIPTION> (optional)
This tag contains additional descriptive information about the data source. If an href (optional) attribute is present, the attribute contains a link to further human-readable information about the data source, such as its home page.

The Entry Points Document

Scope: Sequence and annotation servers.

In Response to Command: entry_points

Format:

<?xml version="1.0" standalone="no"?>
<!DOCTYPE DASEP SYSTEM "dasep.dtd">
<DASEP>
  <ENTRY_POINTS href="url" version="X.XX" ref="refid">
    <SEGMENT id="id1" start="start1" stop="stop1">descriptive text</SEGMENT>
    <SEGMENT id="id2" start="start2" stop="stop2">descriptive text</SEGMENT>
    <SEGMENT id="id3" start="start3" stop="stop3">descriptive text</SEGMENT>
    ...
  </ENTRY_POINTS>
</DASEP>
<!DOCTYPE> (required; one only)
The doctype indicates which formal DTD specification to use. For the entry_points query, the doctype DTD is "dasep.dtd".

<DASEP> (required, one only)
The appropriate doctype and root tag is DASEP.

<ENTRY_POINTS> (required, only one)
There is a single <ENTRY_POINTS> tag. It has a version number (required) in the form "N.NN". Whenever the sequence map changes, the version number should change as well.

If the entry points are not "top level", that is, if they were generated by a request for the substructure of a sequence, then the ref attribute will be present and will indicate the ID of the reference sequence.

The href (required) attribute echoes the URL query that was used to fetch the current document.

<SEGMENT> (optional; zero or more)
Each segment contains the attributes id, start, and stop (all required). The id is a unique identifier, which can be used as the reference ID in further requests to DAS. The start and stop each indicate the position of the sequence within the reference sequence. If no reference sequence is provided (i.e., the sequence is a "top level" object), then start will be 1 and stop will be the full length of the segment object. If no segments are provided for a given query, it is assumed to have no entry_points (substructure).

The body of the <SEGMENT> sections contains human-readable text (optional) for the purposes of display and selection.


The DNA Document

Scope: Sequence servers

In Response to Command: dna

Format:

<?xml version="1.0" standalone="no"?>
<!DOCTYPE DASDNA SYSTEM "dasdna.dtd">
<DASDNA>
<SEQUENCE id="id" start="start" stop="stop" version="X.XX">
<DNA length="NNNN">
atttcttggcgtaaataagagtctcaatgagactctcagaagaaaattgataaatattat
taatgatataataataatcttgttgatccgttctatctccagacgattttcctagtctcc
agtcgattttgcgctgaaaatgggatatttaatggaattgtttttgtttttattaataaa
taggaataaatttacgaaaatcacaaaattttcaataaaaaacaccaaaaaaaagagaaa
aaatgagaaaaatcgacgaaaatcggtataaaatcaaataaaaatagaaggaaaatattc
agctcgtaaacccacacgtgcggcacggtttcgtgggcggggcgtctctgccgggaaaat
tttgcgtttaaaaactcacatataggcatccaatggattttcggattttaaaaattaata
taaaatcagggaaatttttttaaattttttcacatcgatattcggtatcaggggcaaaat
tagagtcagaaacatatatttccccacaaactctactccccctttaaacaaagcaaagag
cgatactcattgcctgtagcctctatattatgccttatgggaatgcatttgattgtttcc
gcatattgtttacaaccatttatacaacatgtgacgtagacgcactgggcggttgtaaaa
cctgacagaaagaattggtcccgtcatctactttctgattttttggaaaatatgtacaat
gtcgtccagtattctattccttctcggcgatttggccaagttattcaaacacgtataaat
aaaaatcaataaagctaggaaaatattttcagccatcacaaagtttcgtcagccttgtta
tgtcaaccactttttatacaaattatataaccagaaatactattaaataagtatttgtat
gaaacaatgaacactattataacattttcagaaaatgtagtatttaagcgaaggtagtgc
acatcaaggccgtcaaacggaaaaatttttgcaagaatca
</DNA>
</SEQUENCE>
</DASDNA>
<!DOCTYPE> (required; one only)
The doctype indicates which formal DTD specification to use. For the dna query, the doctype DTD is "dasdna.dtd".

<DASDNA> (required; one only)
The appropriate doctype and root tag is DASDNA.

<SEQUENCE> (required; one only)
There is a single <SEQUENCES> tag. It has the attributes id, which indicates the reference ID for this sequence, start and stop, which indicate the position of this segment within the reference sequence, and version, which provides the sequence map version number. All four attributes are required.

<DNA> (required; one only)
This tag surrounds the DNA data. It has the attribute length (required), which indicates the length of the DNA. The DNA is found in the body of the tag and is required. DNA should be lower-case and adhere to the IUPAC code conventions.

The Resolve Document

Scope: Sequence and Annotation servers

In Response to Command: resolve

Format:

<?xml version="1.0" standalone="no"?>
<!DOCTYPE DASRES SYSTEM "dasres.dtd">
<DASRES>
  <RESOLVE version="X.XX">
  <SEGMENT id="id" start="start" stop="stop">
      <RELCOORD ref="id" start="start" stop="stop"/>
  </SEGMENT>
</RESOLVE>
</DASRES>
<!DOCTYPE> (required; one only)
The doctype indicates which formal DTD specification to use. For the resolve query, the doctype DTD is "dasres.dtd".

<DASRES> (required; one only)
The appropriate doctype and root tag is DASRES.

<RESOLVE> (required; one only)
There is a single <RESOLVE> tag. Its version (required) attribute indicates the current version of the sequence map.

<SEGMENT> (required; only one)
There is one <SEGMENT> tag, providing information on the reference segment. The id, start and stop (all required) attributes echoes back the coordinates that were used in the resolve query, and can be thought of as the source coordinate system.

<RELCOORD> (required; one or more)
There are one or more <RELCOORD> tags, each providing the coordinates of the enclosing segment relative to a different reference sequence. Mandatory attributes are id, the ID of the reference sequence, plus start and stop (all required), the start and end positions of the segment relative to the reference sequence.

The Feature Types Document

Scope: Annotation servers

In Response to Command: types

Format:

There are two documents formats, the first is a shortened form of the full features format (see below) and is used to summarize the number of annotations of each type available.

<?xml version="1.0" standalone="no"?>
<!DOCTYPE DASTYPES SYSTEM "dastypes.dtd">
<DASTYPES>
  <GFF version="1.2" source="$url">
  <SEGMENT id="id" start="start" stop="stop" version="X.XX">
     <TYPE id="id1" category="category">Type Count 1</TYPE>
     <TYPE id="id2" category="category">Type Count 2</TYPE>
     ...
  </SEGMENT>
  </GFF>
</DASTYPES>
<!DOCTYPE> (required; one only)
The doctype indicates which formal DTD specification to use. For the types query, the doctype DTD is "dastypes.dtd".

<DASTYPES> (required; one only)
The appropriate doctype and root tag is DASTYPES.

<GFF> (required; one only)
There is a single <GFF> tag. Its version (required) attribute indicates the current version of the XML form of the General Feature Format. The current version is (arbitrarily) 0.95. The source (required) attribute echoes the URL query that was used to fetch the current document.

<SEGMENT> (required; only one)
There is one <SEGMENT> tag, providing information on the reference segment . The id, start and stop attributes indicate the coordinate system of the segment. The version attribute indicates the current version of the sequence map. All four attributes are required.

<TYPE> (required; one or more per SEGMENT)
Each segment has one or more <TYPE> tags, which summarize the types of annotation available. The attributes are id (optional), which is a unique id for the annotation type and can be used to retrieve further information from the annotation server (see Linking to a Feature), and the category (optional) attribute, which provides functional grouping to related types. The tag contents (optional) is a human readable label for display purposes.

The Features Document

Scope: Sequence and annotation servers

In Response to Command: features

Format:

The "full" format use to retrieve detailed information on annotations across a segment. This information is summarized through the types query (see above).

<?xml version="1.0" standalone="no"?>
<!DOCTYPE DASGFF SYSTEM "dasgff.dtd">
<DASGFF>
  <GFF version="1.2" source="$url">
  <SEGMENT id="id" start="start" stop="stop" version="X.XX">
      <FEATURE id="id" label="label">
         <TYPE id="id" category="category" reference="yes|no">type label</TYPE>
         <METHOD id="id">method label</TYPE>
         <START>start</START>
         <END>end</END>
         <SCORE>[X.XX|-]</SCORE>
         <ORIENTATION>[0|-|+]</ORIENTATION>
         <PHASE>[0|1|2|-]</PHASE>
	 <GROUP id="hash">
	       <NOTE>note text</NOTE>
	       <LINK href="url">link text</LINK>
	       <TARGET ref="id" start="x" stop="y">target name</TARGET>
	 </GROUP>
      </FEATURE>
      ...
  </SEGMENT>
  </GFF>
</DASGFF>
<!DOCTYPE> (required; one only)
The doctype indicates which formal DTD specification to use. For the resolve query, the doctype DTD is "dasres.dtd".

<DAS> (required; one only)
The appropriate doctype and root tag is DASGFF.

<GFF> (required; one only)
There is a single <GFF> tag. Its version (required) attribute indicates the current version of the XML form of the General Feature Format. The current version is (arbitrarily) 0.95. The source (required) attribute echoes the URL query that was used to fetch the current document.

<SEGMENT> (required; only one)
There is one <SEGMENT> tag, providing information on the reference segment coordinate system. The id, start and stop attributes indicate the position of the segment. The version attribute indicates the current version of the sequence map. All four attributes are required.

<FEATURE> (required; one or more per SEGMENT)
There are one or more <FEATURE> tags per <SEGMENT>, each providing information on one annotation. The id attribute (required) is a unique identifier for the feature. It can be used as a reference point for further navigation. The label attribute (optional) is a suggested label to display for the feature. If not present, the id attribute can be used instead.

<TYPE> (required; one per FEATURE)
Each feature has just one <TYPE> field, which indicates the type of the annotation. The attributes are id (optional), which is a unique id for the annotation type and can be used to retrieve further information from the annotation server (see Linking to a Feature), and the category (optional) attribute, which provides functional grouping to related types. The reference server's annotations can consist of additional overlapping landmarks (parents, children, and neighbors), which should be marked "yes" in the third attribute reference (optional, defaults to "no") to indicate that the feature is a structural landmark within the map (this feature can be annotated). The tag contents (optional) is a human readable label for display purposes.

<METHOD> (required; one per FEATURE)
Each feature has one <METHOD> field, which identifies the method used to identify the feature. The id (optional) tag can be used to retrieve further information from the server. The tag contents (optional) is a human readable label.

<START>, <END> (required; one apiece per FEATURE)
These tags indicate the start and end of the feature in the coordinate system of the reference sequence given in the <SEGMENT> tag. The relationship between the feature start and stop positions and the segment start and stop is that the two spans are guaranteed to overlap.

<SCORE> (required; one per FEATURE)
This is a floating point number indicating the "score" of the method used to find the current feature. The number can only be understood in the context of information retrieved from the server by linking to the method. If this field is inapplicable, the contents of the tag can be replaced with a - symbol.

<ORIENTATION> (required; one per FEATURE)
This tag indicates the orientation of the feature relative to the direction of transcription. It may be 0 for features that are unrelated to transcription, +, for features that are on the sense strand, and -, for features on the antisense strand.

<PHASE> (required; one per FEATURE)
This tag indicates the position of the feature relative to open reading frame, if any. It may be one of the integers 0, 1 or 2, corresponding to each of the three reading frames, or - if the feature is unrelated to a reading frame.

<GROUP> (optional; if present, one per FEATURE)
The <GROUP> section is an oddity, as it is derived from an overloaded field in the GFF flat file format. It provides a unique "group" ID that indicates when certain features are related to each other. The canonical example is the CDS, exons and introns of a transcribed gene, which logically belong together. The id (required) tag provides an identifier that should be used by the client to group features together visually. Unlike other IDs in this protocol, the group ID cannot be used as a database handle to retrieve further information about the group. Such information can, however, be provided within <GROUP> section, which may contain up to three optional tags.

<NOTE> (optional; if present, one per GROUP)
A human-readable note in plain text format.

<LINK> (optional; if present, one per GROUP)
A link to a web page somewhere that provides more information about this group. The href (required) attribute provides the URL target for the link. The link text is an optional human readable label for display purposes.

<TARGET> (optional; if present; one per GROUP)
The target sequence in a sequence similarity match. The ref attribute provides the reference ID for the target sequence, and the start and stop attributes indicate the segment that matched across the target sequence. All three attributes are required. More information on the target can be retrieved by linking back to the annotation server. See Linking to a Feature.

Annotation Stylesheets

Scope: Annotation servers

In Response to Command: stylesheet

This document is intended to provide hints to the annotation display client. It maps feature categories and individual types to a series of glyphs known to the display client. The complete list of glyphs and their attributes is in preparation.

Format:

<?xml version="1.0" standalone="no"?>
<!DOCTYPE DASSTYLE SYSTEM "dasstyle.dtd">
<DASSTYLE>
  <STYLESHEET version="X.XX">

     <CATEGORY id="default">
         <TYPE id="default">
	   <GLYPH>
             <ID>
	       <ATTR>value</ATTR>
	       <ATTR>value</ATTR>
	       ...
             </ID>
	   </GLYPH>
         </TYPE>
     </CATEGORY>

     <CATEGORY id="category1">
         <TYPE id="default">
	   <GLYPH>
             <ID>
	       <ATTR>value</ATTR>
	       ...
             </ID>
	   </GLYPH>
         </TYPE>
         <TYPE id="type1">
	   <GLYPH>
             <ID>
	       <ATTR>value</ATTR>
	       ...
             </ID>
	   </GLYPH>
         </TYPE>
         <TYPE id="type2">
	   <GLYPH>
             <ID>
	       <ATTR>value</ATTR>
	       ...
             </ID>
	   </GLYPH>
         </TYPE>
         ...
     </CATEGORY>

     <CATEGORY id="category2">
         <TYPE id="default">
	   <GLYPH>
             <ID>
	       <ATTR>value</ATTR>
	       ...
             </ID>
	   </GLYPH>
	</TYPE>
         ...
     </CATEGORY>
     ...

</STYLESHEET>
</DASSTYLE>
<!DOCTYPE> (required; one only)
The doctype indicates which formal DTD specification to use. For the stylesheet query, the doctype DTD is "dasstyle.dtd".

<DASSTYLE> (required; one only)
The appropriate doctype and root tag is DASSTYLE.

<STYLESHEET> (required; one only)
There is a single <STYLESHEETgt; tag. Its version (required) attribute indicates the current version of the stylesheet, and can be used for caching purposes.

<CATEGORY> (required; one or more)
There are one or more <CATEGORY> tags, each providing information on the display of a high-level feature category. The id (required) tag uniquely names the category. A special name is "default", which tells the annotation viewer what format to use for categories that are not otherwise specified in the stylesheet.

<TYPE> (required; one or more per CATEGORY)
There are one or more <TYPE> tags per <CATEGORY>, each providing display suggestions for one type of annotation. The id (required) uniquely identifies the type. A special id is "default", which, if present, identifies a default style for the enclosing category.

<GLYPH> (required; one per TYPE)
There is a single <GLYPH> tag per <TYPE>. It provides information on what glyph (graphical widget) to use to display the indicated annotation type.

<ID> (required; one per GLYPH)
The ID value referes to a recognized glyph from the glyph types list (see below).

<ATTR> (optional; one or more per ID)
The recognized ATTR (attributes) are determined by which glyph ID is specified. See the glyph types list below for more information.

For example:

      ...
     <CATEGORY id="Similarity">
	<TYPE id="default">
	     <GLYPH>
                  <LINE>
		       <COLOR>gray</COLOR>
                  </LINE>
	     </GLYPH>
	</TYPE>
	<TYPE id="NN">
	     <GLYPH>
                  <BOX>
		       <COLOR>red</COLOR>
		       <WIDTH>4</WIDTH>
		       <OUTLINECOLOR>black</OUTLINECOLOR>	
                  </BOX>
	     </GLYPH>
	</TYPE>
	<TYPE id="NP">
	     <GLYPH>
                  <TOOMANY>
		       <COLOR>blue</COLOR>
		       <WIDTH>4</WIDTH>
		       <OUTLINECOLOR>black</OUTLINECOLOR>
                  </TOOMANY>
	     </GLYPH>
	</TYPE>
	<TYPE id="PN">
	     <GLYPH>
                  <BOX>
		       <COLOR>green</COLOR>
		       <WIDTH>3</WIDTH>
		       <OUTLINECOLOR>blue</OUTLINECOLOR>
                  </BOX>
	     </GLYPH>
	</TYPE>
	<TYPE id="PP">
	     <GLYPH>
                  <SPAN>
		       <COLOR>gray</COLOR>
		       <WIDTH>4</WIDTH>
                  </SPAN>
	     </GLYPH>
	</TYPE>
     </CATEGORY>
      ...
      

Feature Types and Categories

This is a list of generic feature categories and specific feature types within them. This list was derived from the features currently exported by ACeDB/GFF and is not (yet) comprehensive. Suggestions for modifications, additions and deletions are welcomed.

Translated

The Translated category is used for features that relate to regions of the sequence that are translated into proteins. Features that relate to transcription are separate (see below).

Features:

It is recommended, but not required, that the <GROUP> section contain <LINK> and/or <NOTE> tags that provide further information on the transcription feature.

Transcribed

The Transcribed category is used for features that relate to regions of the sequence that are transcribed into RNA.

Features:

It is recommended, but not required, that the <GROUP> section contain <LINK> and/or <NOTE> tags that provide further information on the transcription feature.

Variation

The Variation category is used for features that relate to regions of the sequence that are polymorphic.

Features:

It is recommended, but not required, that the <GROUP> section contain <LINK> and/or <NOTE> tags that provide further information on the variation.

Structural

The Structural category is used for features that relate to mapping, sequencing and assembly, as well as for various landmarks that carry no intrinsic biological information.

Features:

It is recommended, but not required, that the <GROUP> section contain <LINK> and/or <NOTE> tags that provide further information on the structural feature.

Homology

The Homology category is used for areas that are homologous to other sequences. Homology features should have a <METHOD> tag that indicates the algorithm used for the sequence comparison, and a <TARGET> tag in the <GROUP> field that indicates the target of the match.

Features:

Repeat

The Repeat category is used for areas that contain repetitive DNA. This category is used both for low-complexity regions, such as microsatellites, and for more biologically interesting features, such as transposon insertion sites.

Features:

It is recommended, but not required, that the <GROUP> section contain <LINK> and/or <NOTE> tags that provide further information on the repetitive element.

Experimental

The Experimental category is a catchall used to flag areas where there is interesting experimental data of one sort or another. It is intended for use with high-throughput functional genomics work, such as knockouts or insertional mutagenesis screens.

Features:

It is recommended, but not required, that the <GROUP> section contain <LINK> and/or <NOTE> tags that provide further information on the nature of the experimental data.


Glyph Types

This section describes a set of generic "glyphs" that can be used by sequence display programs to display the position of features on a sequence map. The annotation server may use these glyphs to send display suggestions to the viewer via the stylesheet document.

The current set of glyph ID values are:

Each glyph has a set of generic attributes associated with it. Attribute values come in the following flavors:

INT
An integer
FLOAT
A floating point number (not currently used)
STRING
A text string
COLOR
A color. Colors can be specified using the "#RRGGBB" format commonly used in HTML, or as one of the 16 IBM VGA colors recognized by Netscape and Internet Explorer.
BOOL
A boolean value, either "yes" or "no".
type: FONT
A font. Any of the font identifiers recognized by Web browsers is acceptable, e.g. "helvetica".
FONT_STYLE
One of "bold", "italic", "underline".

BOX

A rectangular box.

Attributes:

WIDTH
type: INT
The width of the box. The width is orthogonal to the axis that defines the extent of the feature on the sequence map.
COLOR
type: COLOR
The interior color of the box.
OUTLINECOLOR
type: COLOR
The color of the box outline.
LINEWIDTH
type: INT
Width of the box outline.

TOOMANY

Too many features than can be shown. Recommended for use in consolidating sequence homology hits. The recommended visual presentation is a set of overlapping boxes.

Attributes:

WIDTH
type: INT
The width of the glyph. The width is orthogonal to the axis that defines the extent of the feature on the sequence map.
COLOR
type: COLOR
The interior color of the glyph.
OUTLINECOLOR
type: COLOR
The color of the glyph.
LINEWIDTH
type: INT
Width of the glyph.

ARROW

A arrow with an axis either orthogonal or parallel to the sequence map.

Attributes:

WIDTH
type: INT
The width of the arrow. The width is orthogonal to the axis that defines the extent of the feature on the sequence map.
COLOR
type: COLOR
The color of the arrow.
PARALLEL
type: BOOL
Arrows run either parallel ("yes") or orthogonal("no") to the sequence axis.
NORTHEAST
type: BOOL
Arrow head is to the right (east) if the arrow runs parallel to the axis and is up away from the sequence axis if the arrow runs orthogonal.
SOUTHWEST
type: BOOL
Arrow head is to the left (west) if the arrow runs parallel to the axis and is down towards the sequence axis if the arrow runs orthogonal.

LINE

A line. Lines are equivalent to arrows with both the northeast and southwest attributes set to "no".

Attributes:

WIDTH
type: INT
The width of the line. The width is orthogonal to the axis that defines the extent of the feature on the sequence map.
COLOR
type: COLOR
The color of the line.

CONNECTOR

The prefered grahical representation is as a "V" shaped line (commonly used to denote connections between exons).

Attributes:

WIDTH
type: INT
The width of the connector. The width is orthogonal to the axis that defines the extent of the feature on the sequence map.
COLOR
type: COLOR
The color of the connector.

TEXT

A bit of text.

Attributes:

FONT
type: FONT
The font.
FONTSIZE
type: INT
The font size.
STRING
type: STRING
The text to render.
STYLE
type: FONT_SYTLE
The style in which to render this glyph. Multiple FONT_STYLE attributes may be present.
COLOR
type: COLOR
The color of the text.

EX

"X" marks the spot. Common used for point mutations and other point-like features.

Attributes:

WIDTH
type: INT
The height/width of the glyph
COLOR
type: COLOR
The color of the text.

CROSS

A cross "+". Common used for point mutations and other point-like features.

Attributes:

WIDTH
type: INT
The height/width of the glyph
COLOR
type: COLOR
The color of the text.

DOT

A dot. Common used for point mutations and other point-like features.

Attributes:

COLOR
type: COLOR
The color of the text.
WIDTH
type: INT
The height/width of the glyph

TRIANGLE

A triangle. Commonly used for point mutations and other point-like features.

Attributes:

COLOR
type: COLOR
The color of the text.
WIDTH
type: INT
The height/width of the glyph
OUTLINECOLOR
type: COLOR
The color of the glyph.
LINEWIDTH
type: INT
Width of the glyph.

SPAN

A spanning region, the recommended representation is a horizontal line with vertical lines at each end.

Attributes:

COLOR
type: COLOR
The color of the text.
WIDTH
type: INT
The height/width of the glyph


Other Issues

The distributed annotation system must have a mechanism for detecting and resolving version skew across reference and annotation servers. Although one such mechanism is currently incorporated into the ACeDB-based prototype, it is largely untested and hence not yet a part of the DAS standard.


Lincoln D. Stein, lstein@cshl.org
Cold Spring Harbor Laboratory
Last modified: Mon Aug 7 16:27:44 EDT 2000

Home C. elegans AcePerl Jade Course DAS BoulderIO WWW Linux