Databases 1 & 2
Jason Stajich and Aaron Mackey
Suggested Reading
- The MySQL on-line tutorial,
http://www.mysql.com/documentation/mysql/bychapter/manual_Tutorial.html#Tutorial
- SAMS Teach Yourself SQL in 10 Minutes, 2ed,
Ben Forta, SAMS Publishing, Indianapolis, 2001. To-the-point and accurate. Please let me know what you think.
You can get scripts and data for the examples from
http://www.forta.com/books/
.
- Perl Cookbook, Tom Christiansen and Nathan Torkington, O'Reilly,
1998, section 14.10, "Executing an SQL Command Using DBI and DBD".
(Talking to a relational database from perl.)
Lecture Notes
Goal:
-
Provide a high level
overview of databases and database management
systems and introduce the main concepts and
specialized vocabulary used in the field.
-
Provide practical experience querying and updating
relational databases.
"Database" versus "Database Management System"
Databases are collections of data; computerized databases
are collections of data on computers.
Computerized databases can be stored in one or more files, or
they can be managed by a software system called a database
management system (DBMS).
To use a database effectively one needs ways to:
- Add data
- Delete data
- Change data
- Look up or search for particular data
- Organize the data
- Organize interactions with the data
- Maintain conventions for understanding the
meaning of the data.
"Flat" File Databases
- One or more files.
- One or more
programs that people use to add, delete, change, and look up data.
Key terminology:
- Field
- Record
- Field delimiter (or terminator), often "\t" (e.g. in "tab separated files"
produced by Excel).
- Record delimiter (or terminator). Often a newline
(perl "\n"), but sometimes another character or string.
Genbank records are terminated with "\n//".
All programs (and programmers) that use the database files
need to understand the organization and conventions of the
database. (What would happen if a record without
the "\n//" got into Genbank?)
For example, records might be of the form
last name\tfirst name\tphone number
(terminated by a newline, and with fields delimited
by "\t"s).
All programs have to understand that
the last name field comes before the
first name field. If the file were maintained
sorted by last name, lookup programs
could to a "binary search" to find a particular
last name, but all update programs need
to cooperate to keep the file sorted by
last name (e.g. when a new record is added or
when a last name is changed).
Indexed File Databases
Main difference from "Flat" File Databases: indexed look
up by key (just like a perl hash, but on disk). Main
advantage: speed.
For example we might want to look up
one entry in Genbank out of 7,077,491 entries (release 118).
Suppose that retrieving
each entry as part of sequential scan requires 0.1 millisecond per entry
on average.
Then looking up one entry by accession would take
708 seconds!
Indexed files use special data structures (roughly
similar to perl hashes) to allow you to look up
one entry in a millisecond or so.
It is easy to work with indexed file databases in perl
using the "DBM" family of perl modules.
See Perl Cookbook, 14.1 to 14.6. If you decide
use this technology, Section 14.5, "Locking DBM Files"
is very important.
All programs that deal with the database do so through
intermediary software (usually running as a server or set of server
processes), called the database management system (DBMS).
The DBMS offers many services including:
- Schemas, which describe and enforce
the "shape" of the data.
- Consistency rules and checking
(integrity constraints)
for what data make sense. These can
simple, involving individual data items, such as
"age must be greater than 0 and less than 150", or
complex, such as "the patient id associated with
each DNA sample must refer to a patient already in
the database".
- Guarantee of no corruption if the client program or
DBMS server abort.
- Non-interfering concurrent access by multiple readers and writers.
- Support for backup and recovery (including
sometimes replication, which automates
near real-time replication
of all updates at a remote database).
- Support for searching, including a query language,
that allows succinct specification of data lookups and updates.
Usually the query language is "SQL", which dates to the early 1980's.
This is the language that Oracle, Sybase, Microsoft SQL Server
and MySQL use.
- Support for various kinds of indexes to make searching
more efficient.
- User authentication and access control.
- DBMS vendors often offer
user interface (UI) design tools (such as
Powerbuilder and Developer 2000) that use the schema to make
the job of specifying user interfaces easier. DBMS vendors are
also now providing server side web support, for example Oracle's
WebDB. These facilities really are not part of the DBMS, however,
but are efforts by the DBMS vendors to make their products
easier to use.
All this support comes at a price: organizations must manage an
additional, complex software system, the DBMS, and the performance
of DBMS based databases is often worse than file databases.
Individuals that manage and administrate DBMSs are called
database administrators (DBAs), though in reality
the job of database administration is often distributed among
several people.
- Relational
- Object Relational
- Extensible Relational or Extensible Object Relational
- "Data Warehouses"/On Line Analytical Processing
- Object Oriented
Relational
This DBMS flavor accounts the vast majority of "serious"
DBMSs. Examples include:
- Oracle
- Sybase
- Microsoft SQL Server
- DB2 (IBM)
- Informix
- MySQL (still missing some "standard" features).
- Postgres
The basic world view of relational databases is that all data is
stored in tables (a technical term in this context).
A table store multiple rows of data, each row has
multiple, fixed columns. ("Rows" and "columns" are
also technical terms in the relational world.) Rows are
are roughly analogous to file records, and columns to
fields, and in fact these terms are often used instead.
A table is roughly analogous to a file (but
do not call it a file when talking about a relational database.)
Relational databases offer a query language
SQL (sometimes pronounced "sequel").
They usually also offer stored procedures,
allowing database designers the ability to
move some common code out of application programs and
into the database.
Object Relational
In relational databases, the value of each column must be
what in perl we call a "scalar": lists, arrays, other
complex objects are forbidden. This restriction is
the first normal form (1NF) restriction.
Object relational database relax this restriction,
and allow arbitrary values in columns, including
lists or tables.
They also allow definition of objects (basically
rows with methods), and allow some forms
of extension (see next).
The "object-relational" query language, SQL3,
supports these additional constructs and is backward
compatible with SQL.
Oracle version 8 is the most visible example of an object relational
DBMS. Adoption of the object-oriented features of Oracle 8 has been
slow, possibly because some user interface tools have not been
able work with the object oriented features.
PostgreSQL is an open source object-relational DBMS
(http://www.us.postgresql.org)
Extensible Relational or Extensible Object Relational
Another form adding objects to relational DBMS is to create new object
types in C++ or Java, and have the DBMS manage them (including
invoking methods while processing queries). Object relational
database usually support this kind of of extensibility,
as do Sybase and DB2.
Object Oriented
Object oriented database management systems are rather different from
object relational databases. The unifying design theme is that the
database stores the objects in the "same" form as they are used
within an object oriented programming language (currently C++ or Java).
(This idea is a bit like a complete generalization of the perl 'tie'
function and the DBM family, which we cover in a separate lecture.)
For example, one can store references to hashes
mapping strings to array references merely by assigning them
to an already persistent variable (one that is in the
database). These databases occupy a niche market. One of their
best use is as "embedded" databases that allow programmers to easily
save program data to disk (and later retrieve it) without the effort
of mapping complex C++ or Java data structures to a form that can be
stored in files or a relational DBMS.
"Data Warehouses"/On Line Analytical Processing (OLAP)
Many DBMS contain data collected primarily for analysis, for
example all the data for all sales at all stores for a drug store chain.
These are called data warehouses and the activity they
support is called on line analytical processing (OLAP).
These databases are often updated by nightly batch
updates that extract critical information from the transactional
systems.
On Line Transaction Processing (OLTP)
These are DBMS that support the transactional data mentioned above.
The quintessential examples are airline reservation systems or
trading systems. At the high end these systems can manage
many thousands of updates a minutes, and often involve special
terminal equipment.
(A dichotomy between 'transactional' and
'warehouse' databases exists in high-throughput biology, where a
LIMS (laboratory information management system) keeps track
of individual experiments and their results, and is used
only by a relatively small production team, but finished data is
deposited in a data warehouse which many scientist can use for
their analysis.)
(The use of the word "transaction" here is only loosely connected
to its use in "on line transaction system".) DBMS
transactions provide two of the important
capabilities of DBMS discussed above:
- Independent concurrent access by multiple readers and writers.
- Guarantee of no corruption if the client program or
DBMS server abort.
plus a few more.
Often transactions are bracketed by begin transaction
and commit query language statements.
Usually all the work done since the last "begin transaction"
statement can be undone with an abort or
rollback statement.
Here are examples that show the importance
of (i) independent concurrent access by multiple readers and writers.
and (ii) guarantees of no corruption if the client program or
DBMS server abort.
Interfering Concurrent Writers
Concurrent deposit and withdrawal from joint banking account.
| Account Holder 1 (will deposit $20) | Account Holder2 (will withdraw $30) | Balance in Database
|
| | | $100
|
| read balance from DB ($100) | | $100
|
| | read balance from DB ($100) | $100
|
| balance = balance + 20 ($120) | | $100
|
| | balance = balance - 30 ($70) | $100
|
| save balance ($120) | | $120
|
| | save balance ($70) | $70
|
Corruption on Crash
Transfer $40 from checking to savings.
| Action | Checking Balance in Database | Savings Balance in Database
|
| get checking balance ($100) | $100 | $0
|
| get savings balance ($0) | $100 | $0
|
| checking balance = checking balance - $40 | $100 | $0
|
| savings balance = savings balance + $40 | $100 | $0
|
| save checking balance ($60) | $60 | $0
|
| C R A S H !
|
- MS Access
- A database "front end" that gives an spreadsheet like interface to
a DBMS (or, to a lightweight built-in DBMS if no other DBMS is available). Probably
quite useful if used judiciously. For Oracle there is TOAD (tool for
Oracle developers), which provides a fairly generic database
browser, include spreadsheet views of each table.
- PC and Mac database management systems.
- Probably not worth using for most bioinformatics tasks. If it is worth
using a DBMS, it is probably worth doing it on Unix and setting up a full featured DBMS,
or at least MySQL (for light tasks).
- Locking
- One way to make sure that concurrent writers and readers do not
corrupt data is to use locks. A process that wants
to write to a file or table (or record, or database) asks
for a write lock. If another process is reading or writing
then
the process that wants to write has to wait. A process that wants to
read ask for a read lock The locking mechanism
never grants a write lock when a read lock or another write lock
is held, and never grants a read lock when a write lock is held.
(This is one of the mechanisms
that DBMS use, in incredibly sophisticated and elaborate form,
to implement transactions.) However, Unix itself also offers
facilities for locking files (more in a subsequent lecture).
- The ACEDB data manager.
- A custom database management system (basically a kind of object oriented
system, but not like the object oriented systems discussed above). Views
objects as trees, often of tag/value pairs. The name stands for
"A C. elegans Data Base", and there is often confusion between the
ACeDB data manager and the ACeDB database content (i.e.C. elegansdata)
hosted in the ACeDB data manager. Suitable for read-mostly databases of biological data,
with many almost ready-to-use schemas.
Offers a query language (not SQL) and locking to prevent
problems with concurrent access.
Advantages are a built-in data browser and full-featured
graphical user interfaces for molecular biology data.
Further Reading
To get more background you can look at
conceptual books for academic computer science courses.
One reasonable choice is C. J. Date,
An Introduction to Database Systems
(Introduction to Database Systems, Ed 7) Amazon.com:
$64.95. 975 pages (October 1999) Addison-Wesley Pub Co; ISBN:
0201385902, which at least has been around a long time. See the
amazon.com reviews; some hate it, some like it.
This book does not provide details of any specific DBMS (i.e.
it is not a how-to book), and starts out with some
theoretical (but probably useful) underpinnings.
For MySQL Gabor Marth recommends Paul DuBois, MySQL,
New Riders, 2000.
There is also an unending supply of
"DBMS X Unleashed" type books, often hastily thrown together to
cash in the on the latest release. These books can be good if you are
working with DBMS X. Browse your local technical bookstore and
find one or two you like for the DBMS you are using. Oracle has a
whole series from Oracle Press, but Oracle's on-line
documentation is more useful.
Examples, Example Data Sets, etc.
Example DBI program (similar
to Perl Cookbook recipe 14.10).
Menagerie create table and load data statements.
Menagerie pet table data (pet.txt).
Menagerie event table data (event.txt).
Genome Informatics
Steve Rozen,
rozen@wi.mit.edu
Whitehead Institute for Biomedical
Research
Last modified: Mon Oct 25 17:22:56 EDT 1999