Databases 1 & 2

Jason Stajich and Aaron Mackey

Genome Informatics

Suggested Reading

  1. The MySQL on-line tutorial, http://www.mysql.com/documentation/mysql/bychapter/manual_Tutorial.html#Tutorial
  2. SAMS Teach Yourself SQL in 10 Minutes, 2ed, Ben Forta, SAMS Publishing, Indianapolis, 2001. To-the-point and accurate. Please let me know what you think. You can get scripts and data for the examples from http://www.forta.com/books/ .
  3. Perl Cookbook, Tom Christiansen and Nathan Torkington, O'Reilly, 1998, section 14.10, "Executing an SQL Command Using DBI and DBD". (Talking to a relational database from perl.)

Lecture Notes

Goal: 
  1. Provide a high level overview of databases and database management systems and introduce the main concepts and specialized vocabulary used in the field.
  2. Provide practical experience querying and updating relational databases.

"Database" versus "Database Management System"

Databases are collections of data; computerized databases are collections of data on computers. Computerized databases can be stored in one or more files, or they can be managed by a software system called a database management system (DBMS).

To use a database effectively one needs ways to:

"Flat" File Databases

Key terminology: All programs (and programmers) that use the database files need to understand the organization and conventions of the database. (What would happen if a record without the "\n//" got into Genbank?)

For example, records might be of the form

last name\tfirst name\tphone number
(terminated by a newline, and with fields delimited by "\t"s). All programs have to understand that the last name field comes before the first name field. If the file were maintained sorted by last name, lookup programs could to a "binary search" to find a particular last name, but all update programs need to cooperate to keep the file sorted by last name (e.g. when a new record is added or when a last name is changed).

Indexed File Databases

Main difference from "Flat" File Databases: indexed look up by key (just like a perl hash, but on disk). Main advantage: speed. For example we might want to look up one entry in Genbank out of 7,077,491 entries (release 118).

Suppose that retrieving each entry as part of sequential scan requires 0.1 millisecond per entry on average. Then looking up one entry by accession would take 708 seconds!

Indexed files use special data structures (roughly similar to perl hashes) to allow you to look up one entry in a millisecond or so.

It is easy to work with indexed file databases in perl using the "DBM" family of perl modules. See Perl Cookbook, 14.1 to 14.6. If you decide use this technology, Section 14.5, "Locking DBM Files" is very important.

DBMS Based Databases

All programs that deal with the database do so through intermediary software (usually running as a server or set of server processes), called the database management system (DBMS).

The DBMS offers many services including:

All this support comes at a price: organizations must manage an additional, complex software system, the DBMS, and the performance of DBMS based databases is often worse than file databases. Individuals that manage and administrate DBMSs are called database administrators (DBAs), though in reality the job of database administration is often distributed among several people.

DBMS "Flavors"

  1. Relational
  2. Object Relational
  3. Extensible Relational or Extensible Object Relational
  4. "Data Warehouses"/On Line Analytical Processing
  5. Object Oriented

Relational

This DBMS flavor accounts the vast majority of "serious" DBMSs. Examples include:
  1. Oracle
  2. Sybase
  3. Microsoft SQL Server
  4. DB2 (IBM)
  5. Informix
  6. MySQL (still missing some "standard" features).
  7. Postgres
The basic world view of relational databases is that all data is stored in tables (a technical term in this context). A table store multiple rows of data, each row has multiple, fixed columns. ("Rows" and "columns" are also technical terms in the relational world.) Rows are are roughly analogous to file records, and columns to fields, and in fact these terms are often used instead. A table is roughly analogous to a file (but do not call it a file when talking about a relational database.)

Relational databases offer a query language SQL (sometimes pronounced "sequel").

They usually also offer stored procedures, allowing database designers the ability to move some common code out of application programs and into the database.

Object Relational

In relational databases, the value of each column must be what in perl we call a "scalar": lists, arrays, other complex objects are forbidden. This restriction is the first normal form (1NF) restriction. Object relational database relax this restriction, and allow arbitrary values in columns, including lists or tables. They also allow definition of objects (basically rows with methods), and allow some forms of extension (see next).

The "object-relational" query language, SQL3, supports these additional constructs and is backward compatible with SQL.

Oracle version 8 is the most visible example of an object relational DBMS. Adoption of the object-oriented features of Oracle 8 has been slow, possibly because some user interface tools have not been able work with the object oriented features.

PostgreSQL is an open source object-relational DBMS (http://www.us.postgresql.org)

Extensible Relational or Extensible Object Relational

Another form adding objects to relational DBMS is to create new object types in C++ or Java, and have the DBMS manage them (including invoking methods while processing queries). Object relational database usually support this kind of of extensibility, as do Sybase and DB2.

Object Oriented

Object oriented database management systems are rather different from object relational databases. The unifying design theme is that the database stores the objects in the "same" form as they are used within an object oriented programming language (currently C++ or Java). (This idea is a bit like a complete generalization of the perl 'tie' function and the DBM family, which we cover in a separate lecture.) For example, one can store references to hashes mapping strings to array references merely by assigning them to an already persistent variable (one that is in the database). These databases occupy a niche market. One of their best use is as "embedded" databases that allow programmers to easily save program data to disk (and later retrieve it) without the effort of mapping complex C++ or Java data structures to a form that can be stored in files or a relational DBMS.

"OLAP" and "OLTP"

"Data Warehouses"/On Line Analytical Processing (OLAP)

Many DBMS contain data collected primarily for analysis, for example all the data for all sales at all stores for a drug store chain. These are called data warehouses and the activity they support is called on line analytical processing (OLAP). These databases are often updated by nightly batch updates that extract critical information from the transactional systems.

On Line Transaction Processing (OLTP)

These are DBMS that support the transactional data mentioned above. The quintessential examples are airline reservation systems or trading systems. At the high end these systems can manage many thousands of updates a minutes, and often involve special terminal equipment.

(A dichotomy between 'transactional' and 'warehouse' databases exists in high-throughput biology, where a LIMS (laboratory information management system) keeps track of individual experiments and their results, and is used only by a relatively small production team, but finished data is deposited in a data warehouse which many scientist can use for their analysis.)

Transactions

(The use of the word "transaction" here is only loosely connected to its use in "on line transaction system".) DBMS transactions provide two of the important capabilities of DBMS discussed above:

plus a few more.

Often transactions are bracketed by begin transaction and commit query language statements. Usually all the work done since the last "begin transaction" statement can be undone with an abort or rollback statement.

Here are examples that show the importance of (i) independent concurrent access by multiple readers and writers. and (ii) guarantees of no corruption if the client program or DBMS server abort.

Interfering Concurrent Writers

Concurrent deposit and withdrawal from joint banking account.
Account Holder 1 (will deposit $20) Account Holder2 (will withdraw $30) Balance in Database
$100
read balance from DB ($100) $100
read balance from DB ($100) $100
balance = balance + 20 ($120) $100
balance = balance - 30 ($70) $100
save balance ($120) $120
save balance ($70) $70

Corruption on Crash

Transfer $40 from checking to savings.
Action Checking Balance in Database Savings Balance in Database
get checking balance ($100) $100 $0
get savings balance ($0) $100 $0
checking balance = checking balance - $40 $100 $0
savings balance = savings balance + $40 $100 $0
save checking balance ($60) $60 $0
C R A S H !

Other Topics

MS Access
A database "front end" that gives an spreadsheet like interface to a DBMS (or, to a lightweight built-in DBMS if no other DBMS is available). Probably quite useful if used judiciously. For Oracle there is TOAD (tool for Oracle developers), which provides a fairly generic database browser, include spreadsheet views of each table.

PC and Mac database management systems.
Probably not worth using for most bioinformatics tasks. If it is worth using a DBMS, it is probably worth doing it on Unix and setting up a full featured DBMS, or at least MySQL (for light tasks).

Locking
One way to make sure that concurrent writers and readers do not corrupt data is to use locks. A process that wants to write to a file or table (or record, or database) asks for a write lock. If another process is reading or writing then the process that wants to write has to wait. A process that wants to read ask for a read lock The locking mechanism never grants a write lock when a read lock or another write lock is held, and never grants a read lock when a write lock is held. (This is one of the mechanisms that DBMS use, in incredibly sophisticated and elaborate form, to implement transactions.) However, Unix itself also offers facilities for locking files (more in a subsequent lecture).

The ACEDB data manager.
A custom database management system (basically a kind of object oriented system, but not like the object oriented systems discussed above). Views objects as trees, often of tag/value pairs. The name stands for "A C. elegans Data Base", and there is often confusion between the ACeDB data manager and the ACeDB database content (i.e.C. elegansdata) hosted in the ACeDB data manager. Suitable for read-mostly databases of biological data, with many almost ready-to-use schemas.

Offers a query language (not SQL) and locking to prevent problems with concurrent access. Advantages are a built-in data browser and full-featured graphical user interfaces for molecular biology data.

Further Reading

To get more background you can look at conceptual books for academic computer science courses. One reasonable choice is C. J. Date, An Introduction to Database Systems (Introduction to Database Systems, Ed 7) Amazon.com: $64.95. 975 pages (October 1999) Addison-Wesley Pub Co; ISBN: 0201385902, which at least has been around a long time. See the amazon.com reviews; some hate it, some like it. This book does not provide details of any specific DBMS (i.e. it is not a how-to book), and starts out with some theoretical (but probably useful) underpinnings.

For MySQL Gabor Marth recommends Paul DuBois, MySQL, New Riders, 2000.

There is also an unending supply of "DBMS X Unleashed" type books, often hastily thrown together to cash in the on the latest release. These books can be good if you are working with DBMS X. Browse your local technical bookstore and find one or two you like for the DBMS you are using. Oracle has a whole series from Oracle Press, but Oracle's on-line documentation is more useful.

Examples, Example Data Sets, etc.

Example DBI program (similar to Perl Cookbook recipe 14.10).

Menagerie create table and load data statements.

Menagerie pet table data (pet.txt).

Menagerie event table data (event.txt).


Genome Informatics
Steve Rozen, rozen@wi.mit.edu

Whitehead Institute for Biomedical Research
Last modified: Mon Oct 25 17:22:56 EDT 1999