Home C. elegans AcePerl Gramene Reactome GMOD Course DAS WWW

Boulder.pm - A semantic free data interchange format


NAME
Boulder.pm - A semantic free data interchange format


SYNOPSIS

   #!/usr/local/bin/perl
   # This is an example of pulling out a series of
   # sequence records, picking primer pairs, and
   # inserting the new STS into the stream.
   
   use Boulder::Stream;
   
   $stream = new Boulder::Stream;
   
   while ($stone = $stream->read_record('SEQUENCE')) {
   
      $primers = &pick_primers($stone->SEQUENCE);
      if ($primers) {
         $stone->insert(PRIMER_OUTCOME=>1,STS=>$primers);
      } else {
         $stone->insert(PRIMER_OUTCOME=>0);
      }
   
   } continue {
      $stream->write_record($stone);
   }
   
   
   sub pick_primers {
      my $sequence = shift;
      return undef unless ($lstart,$llen,$rstart,$rlen) =
        &call_primer_picking_algorithm($sequence);
      # Create and return a new Stone to drop into the
      # stream.
      return new Stone(LEFT_START=>$lstart,
                       LEFT_LENGTH=>$llen,
                       RIGHT_START=>$rstart,
                       RIGHT_LENGTH=>$rlen);
   }


DESCRIPTION


Boulder IO

Boulder IO is a simple TAG=VALUE data format designed for sharing data between programs connected via a pipe. It is also simple enough to use as a common data exchange format between databases, Web pages, and other data representations.

The basic data format is very simple. It consists of a series of TAG=VALUE pairs separated by newlines. It is record-oriented. The end of a record is indicated by an empty delimiter alone on a line. The delimiter is ``='' by default, but can be adjusted by the user.

An example boulder stream looks like this:

        NAME=AFM036YB2
        SEQUENCE=GATCCGAGATACAGAGGATCCCACACACACACACAC....
        LEFT_START=0
        LEFT_LENGTH=14
        RIGHT_START=140
        RIGHT_LENGTH=20
        =
        NAME=MR1239
        SEQUENCE=GGGATCTTTCTCCACCTCCAGCGCACAGAGCAGGGGG...
        LEFT_START=2
        LEFT_LENGTH=20
        RIGHT_START=130
        RIGHT_LENGTH=18
        ALIAS=WI-1000
        ALIAS=XB2023
        =

Notes:

(1)
There is no need for all tags to appear in all records, or indeed for all the records to be homogeneous.
(2)
Multiple values are allowed, as with the ALIAS tag in the second record.

(3)
Lines can be any length, as in a potential 40 K sequence entry.

(4)
Tags can be any alphanumeric character (upper or lower case) and may contain embedded spaces. Conventionally we use the characters A-Z0-9_, because they can be used without single quoting as keys in Perl associative arrays, but this is merely stylistic. Values can be any character at all except for the reserved characters {}=% and newline. You can incorporate binary data into the data stream by escaping these characters in the URL manner, using a % sign followed by the (capitalized) hexadecimal code for the character. The module makes this automatic.


Extended Boulder Format

The simple boulder format has been extended to accomodate nested relations and other intresting structures. The new format allows nested records to be created in this way:

   NAME=AFM036YB2
   SEQUENCE=GATCCGAGATACCCCACACACACACACAC....
   STS={
      LEFT_START=0
      LEFT_LENGTH=14
      RIGHT_START=140
      RIGHT_LENGTH=20
   }
   =
   NAME=MR1239
   SEQUENCE=GGGATCTTTCTCCACCTTGGAGAGCAGGGGG...
   STS={
      LEFT_START=2
      LEFT_LENGTH=20
      RIGHT_START=130
      RIGHT_LENGTH=18
   }
   ALIAS=WI-1000
   ALIAS=XB2023
   =

As in the original format, tags may be multivalued. For example, there might be several STS records assigned to a sequence. Each subrecord may contain further subrecords.


Pass through behavior

A major attribute of the boulderio protocol is the convention that programs reading boulderio streams should only remove from the stream those tags that they're interested in. Any unrecognized tags are passed through to other programs that might be interested. In fact, most programs will want to put the tags back into the boulder stream once they're finished, potentially adding their own. Of course some programs will want to behave differently. For example, a database query program will generate but not read a boulderio stream, while a report generator will read but not write the stream.

This convention allows the following type of pipe to be set up:

  query_database | find_vector | find_dups | \
    | blast_sequence | pick_primer | mail_report

If all the programs in the pipe follow the conventions, then it will be possible to interpose other programs, such as a repetitive element finder, in the middle of the pipe without disturbing other components.


SKELETON BOULDER PROGRAM

Here is a skeleton example.

   #!/usr/local/bin/perl
   # This is an example of pulling out a series of
   # sequence records, picking primer pairs, and
   # inserting the new STS into the stream.
   
   use Boulder::Stream;
   
   $stream = new Boulder::Stream;
   
   while ($stone = $stream->read_record('SEQUENCE')) {
   
      $primers = &pick_primers($stone->SEQUENCE);
      if ($primers) {
         $stone->insert(PRIMER_OUTCOME=>1,STS=>$primers);
      } else {
         $stone->insert(PRIMER_OUTCOME=>0);
      }
   
   } continue {
      $stream->write_record($stone);
   }
   
   
   sub pick_primers {
      my $sequence = shift;
      return undef unless ($lstart,$llen,$rstart,$rlen) =
        &call_primer_picking_algorithm($sequence);
      # Create and return a new Stone to drop into the
      # stream.
      return new Stone(LEFT_START=>$lstart,
                       LEFT_LENGTH=>$llen,
                       RIGHT_START=>$rstart,
                       RIGHT_LENGTH=>$rlen);
   }

As the example shows, the code starts by creating a Boulder::Stream object to handle the I/O. It next enters a read/write loop in which the tags the program is interested are returned one after another. In the body of the loop we create a subrecord (a so-called Stone object) containing a newly-chosen STS primer pair. This is now added to the record with an STS tag, along with a PRIMER_OUTCOME tag indicating whether primer picking was successful.

A full explanation of the objects and methods follows.


CLASSES

There are four classes defined in Boulder:


STONE METHODS


Stone::New()

This is the constructor for the Stone class. It can be called without any parameters, in which case it creates an empty Stone, or passed an associative array in order to initialize it with a set of keys. Examples:

        $myStone = new Stone;
        $myStone = new Stone(name=>fred,age=>30);


Stone::insert(%hash)

This is the main method for adding tags to a Stone. This method expects an associative array as an argument. The contents of the associative array will be inserted into the Stone. If a particular tag is already present in the Stone, the tag's current value will be appended to the list of values for that tag. Several types of values are legal:


Stone::replace(%hash)

The replace() method behaves exactly like insert() with the exception that if the indicated key already exists in the Stone, its value will be replaced. Use replace() when you want to enforce a single-valued tag/value relationship.


Stone::insert_list($key,@list), Stone::insert_hash($key,%hash), Stone::replace_list($key,@list), Stone::replace_hash($key,%hash)

These are primitives used by the insert() and replace() methods. Override them if you need to modify the default behavior.


Stone::delete($key)

This removes the indicated key from the Stone.


Stone::get($tag,$index)

This returns the value at the indicated tag and optional index. What you get depends on whether it is called in a scalar or list context. In a list context, you will receive all the values for that tag. You may receive a list of scalar values or (for a nested record) or a list of Stone objects. If called in a scalar context, you will either receive the first or the last member of the list of values assigned to the tag. Which one you receive depends on the value of the package variable $Stone::Fetchlast. If undefined, you will receive the first member of the list. If nonzero, you will receive the last member.

You may provide an optional index in order to force get() to return a particular member of the list. Provide a 0 to return the first member of the list, or '#' to obtain the last member.

If the tag contains a period (.), get() will call Stone::index() on your behalf (see below).


Stone::search($tag)

Searches for the first occurrence of the tag, traversing the tree in a breadth-first manner, and returns it. This allows you to retrieve the value of a tag in a deeply nested structure without worrying about all the intermediate nodes. For example:

   NAME=AFM036YB2
   SEQUENCE=GATCCGAGATACCCCACACACACACACACACACACAC
   STS={
      LEFT_START=0
      LEFT_LENGTH=14
      RIGHT_START=140
      RIGHT_LENGTH=20
   }

   $left_start = $stone->search('LEFT_START');

The disadvantage of this is that if there is a tag named LEFT_START higher in the hierarchy, this tag will be retrieved rather than the lower one. In an array context this method returns the complete list of values from the matching tag. In a scalar context, it returns either the first or the last value of multivalued tags depending as usual on the value of $Stone::Fetchlast.

$Stone::Fetchlast is also consulted during the depth-first traversal. If Fetchlast is set to a true value, multivalued intermediate tags will be searched from the last to the first rather than the first to the last.

The Stone object has an AUTOLOAD method that invokes search() when you call a method that is not predefined. This allows a very convenient type of shortcut:

  $name        = $stone->NAME;
  $left_start  = $stone->LEFT_START;
  $right_start = $stone->STS->RIGHT_START;

In the first example, we retrieve the value of the top-level tag NAME. In the second example, we retrieve the value of the LEFT_START tag, using search() to find it even though it is a nested tag. In the third example, we retrieve the STS stone first, then the RIGHT_START value. This is nominally faster than the second example, because no tree traversal is involved.


Stone::index($indexstr)

You can access the contents of even deeply-nested Stone objects with the index method. You provide a tag path, and receive a value or list of values back.

Tag paths look like this:

        tag1[index1].tag2[index2].tag3[index3]

Numbers in square brackets indicate which member of a multivalued tag you're interested in getting. You can leave the square brackets out in order to return just the first or the last tag of that name, in a scalar context (depending on the setting of $Stone::Fetchlast). In an array context, leaving the square brackets out will return all multivalued members for each tag along the path.

You will get a scalar value in a scalar context and an array value in an array context following the same rules as get(). You can provide an index of '#' in order to get the last member of a list or a [?] to obtain a randomly chosen member of the list (this uses the rand() call, so be sure to call srand() at the beginning of your program in order to get different sequences of pseudorandom numbers. If there is no tag by that name, you will receive undef or an empty list. If the tag points to a subrecord, you will receive a Stone object.

Examples:

        # Here's what the data structure looks like.
        $s->insert(person=>{name=>Fred,
                            age=>30,
                            pets=>[Fido,Rex,Lassie],
                            children=>[Tom,Mary]},
                   person=>{name=>Harry,
                            age=>23,
                            pets=>[Rover,Spot]});

        # Return all of Fred's children
        @children = $s->index('person[0].children');

        # Return Harry's last pet
        $pet = $s->index('person[1].pets[#]');

        # Return first person's first child
        $child = $s->index('person.children');

        # Return children of all person's
        @children = $s->index('person.children');

        # Return last person's last pet
        $Stone::Fetchlast++;
        $pet = $s->index('person.pets');

        # Return any pet from any person
        $pet = $s->index('person[?].pet[?]');

Note that index() may return a Stone object if the tag path points to a subrecord.


Stone::at($tag)

This returns an ARRAY REFERENCE for the tag. It is useful to prevent automatic dereferencing. Use with care. It is equivalent to:

        $stone->{'tag'}


Stone::tags()

Return all the tags in the Stone.


Stone::dump()

This is a debugging tool. Iterates through the Stone object and prints out all the tags and values.

Example:

        $s->dump;
        
        person[0].children[0]=Tom
        person[0].children[1]=Mary
        person[0].name[0]=Fred
        person[0].pets[0]=Fido
        person[0].pets[1]=Rex
        person[0].pets[2]=Lassie
        person[0].age[0]=30
        person[1].name[0]=Harry
        person[1].pets[0]=Rover
        person[1].pets[1]=Spot
        person[1].age[0]=23


Stone::cursor()

Return an iterator over the Stone object. You can call this several times in order to return independent iterators. See below.


Boulder::Cursor METHODS


Boulder::Cursor::new($stone)

Return a new Boulder::Cursor over the specified Stone object. This will return an error unless the object is a Stone or a descendent.


Boulder::Cursor::each()

Iterate over the attached Stone. Each iteration will return a two-valued list consisting of a tag path and a value. The tag path is of a form that can be used with Stone::index() (in fact, a cursor is used internally to implement the Stone::dump() method. When the end of the Stone is reached, each() will return an empty list, after which it will start over again from the beginning. If you attempt to insert or delete from the stone while iterating over it, all attached cursors will reset to the beginnning.

For example:

        $cursor = $s->cursor;
        while (($key,$value) = $cursor->each) {
           print "$value: BOW WOW!\n" if $key=~/pet/;           
        }


Boulder::Cursor::reset()

This resets the cursor back to the beginning of the associated Stone.


Boulder::Stream METHODS


Boulder::Stream::new(IN,OUT)

The new() method creates a new Boulder::Stream object. You can provide input and output filehandles. If you leave one or both undefined new() will default to standard input or standard output. You are free to use files, pipes, sockets, and other types of file handles.

Version 1.05 (and higher) handles typeglobs and typeglob references. Unfortunately earlier versions of this library did not.


Boulder::Stream::read_record(@taglist)

Every time read_record() is called, it will return a new Stone object. The Stone will be created from the input stream, using just the tags provided in the argument list. Pass no tags to receive whatever tags are present in the input stream.

If none of the tags that you specify are in the current boulder record, you will receive an empty Stone. At the end of the input stream, you will receive undef.

If called in an array context, read_record() returns a list of all stones from the input stream that contain one or more of the specified tags.


Boulder::Stream::get(@taglist)

Identical to read_record(), but the name is shorter.


Boulder::Stream::write_record($stone)

Write a Stone to the output filehandle.


Boulder::Stream::put($stone)

Identical to write_record(), but the name is shorter.


Useful State Variables in a Boulder::Stream

Every Boulder::Stream has several state variables that you can adjust. Fix them in this fashion:

        $a = new Boulder::Stream;
        $a->{delim}=':';
        $a->{record_start}='[';
        $a->{record_end}=']';
        $a->{passthru}=undef;


Boulder::Store METHODS


Boulder::Store::new("database/path",writable)

The new() method creates a new Boulder::Store object and associates it with the database file provided in the first parameter (undef is a valid pathname, in which case all methods work but the data isn't stored). The second parameter should be a true value if you want to open the database for writing. Otherwise it's opened read only.

Because the underlying storage implementation is not multi-user, only one process can have the database for writing at a time. A fcntl()-based locking mechanism is used to give a process that has the database opened for writing exclusive access to the database. This also prevents the database from being opened for reading while another process is writing to it (this is a good thing). Multiple simultaneous processes can open the database read only.

Physically the data is stored in a human-readable file with the extension ``.data''.


Boulder::Store::read_record(@taglist)

The semantics of this call are exactly the same as in Boulder::Stream. Stones are returned in sequential order, starting with the first record. In addition to their built-in tags, each stone returned from this call has an additional tag called ``record_no''. This is the zero-based record number of the stone in the database. Use the reset() method to begin iterating from the beginning of the database.

If called in an array context, read_record() returns a list of all stones in the database that contains one or more of the provided tags.


Boulder::Store::write_record($stone [,$index])

This has the same semantics as Boulder::Stream. A stone is appended to the end of the database. If successful, this call returns the record number of the new entry. By providing an optional second parameter, you can control where the stone is entered. A positive numeric index will write the stone into the database at that position. A value of -1 will use the Stone's internal record number (if present) to determine where to place it.


Boulder::Store::get($record_no)

This is random access to the database. Provide a record number and this call will return the stone stored at that position.


Boulder::Store::put($stone,$record_no)

This is a random write to the database. Provide a record number and this call stores the stone at the indicated position, replacing whatever was there before.

If no record number is provided, this call will look for the presence of a 'record_no' tag in the stone itself and put it back in that position. This allows you to pull a stone out of the database, modify it, and then put it back in without worrying about its record number.

The record number of the inserted stone is returned from this call.


Boulder::Store::delete($stone),Boulder::Store::delete($record_no)

These method calls delete a stone from the database. You can provide either the record number or a stone containing the 'record_no' tag. Warning: if the database is heavily indexed deletes can be time-consuming as it requires the index to be brought back into synch.


Boulder::Store::length()

This returns the length of the database, in records.


Boulder::Store::reset()

This resets the database, nullifying any queries in effect, and causing read_record() to begin fetching stones from the first record.


Boulder::Store::query(%query_array)

This creates a query on the database used for selecting stones in read_record(). The query is an associative array. Three types of keys/value pairs are allowed:

  1. $index=>$value

    This instructs Boulder::Store to look for stones containing the specified tags in which the tag's value (determined by the Stone index() method) exactly matches the provided value. Example:

            $db->query('STS.left_primer.length'=>30);
    

    Only the non-bracketed forms of the index string are allowed (this is probably a bug...)

    If the tag path was declared to be an index, then this search will be fast. Otherwise Boulder::Store must iterate over every record in the database.

  2. EVAL=>'expression' This instructs Boulder::Store to look for stones in which the provided expression evaluates to true. When the expression is evaluated, the variable $s will be set to the current record's stone. As a shortcut, you can use ``<index.string>'' as shorthand for ``$s->index('index.string')''.

  3. EVAL=>['expression1','expression2','expression3'...] This lets you provide a whole bunch of expressions, and is exactly equivalent to EVAL=>'(expression1) && (expression2) && (expression3)'.

You can mix query types in the parameter provided to query(). For example, here's how to look up all stones in which the sex is male and the age is greater than 30:

        $db->query('sex'=>'M',eval=>'<age> > 30');

When a query is in effect, read_record() returns only Stones that satisfy the query. In an array context, read_record() returns a list of all Stones that satisfy the query. When no more satisfactory Stones are found, read_record() returns undef until a new query is entered or reset() is called.


Boulder::Store::add_index(@indices)

Declare one or more tag paths to be a part of a fast index. read_record() will take advantage of this record when processing queries. For example:

        $db->add_index('age','sex','person.pets');

You can add indexes any time you like, when the database is first created or later. There is a trade off: write_record(), put(), and other data-modifying calls will become slower as more indexes are added.

The index is stored in an external file with the extension ``.index''. An index file is created even if you haven't indexed any tags.


Boulder::Store::reindex_all()

Call this if the index gets screwed up (or lost). It rebuilds it from scratch.


BUGS

Because the delim, record_start and record_end characters in the Boulder::Stream object are used in optimized (once-compiled) pattern matching, you cannot change these values once read_record() has once been called. To change the defaults, you must create the Boulder::Stream, set the characters, and only then begin reading from the input stream. For the same reason, different Boulder::Stream objects cannot use different delimiters.


AUTHOR

Lincoln D. Stein , Cold Spring Harbor Laboratory, Cold Spring Harbor, NY. This module can be used and distributed on the same terms as Perl itself.


SEE ALSO

perl,perlbot,perltoot
Home Jade ACEDB BoulderIO Perl