See Appendix B of SQL in 10 Minutes for the main differences between the most popular DBMS.
Luckily, thanks to some expert Perl programming, there is a way to (mostly) get around the proliferation of
different DBMS with their special ways of doing things and their special
extensions of SQL.
In this chapter we'll use the Perl
First, I enter the interactive
[tisdall@coltrane tisdall]$ mysql -u tisdall -p Enter password: Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 2 to server version: 3.23.41 Type 'help;' or '\h' for help. Type '\c' to clear the buffer.
Next I ask to be shown a list of all the databases that are defined in my MySQL DBMS:
mysql> show databases; +----------+ | Database | +----------+ | caudyfly | | dicty | | gadfly | | master | | mysql | | poetry | | yeast | +----------+ 7 rows in set (0.15 sec)
I want to create a database called "homologs".
First I create it, then check that it's there,
and then I make it the active database with
mysql> create database homologs; Query OK, 1 row affected (0.00 sec) mysql> show databases; +----------+ | Database | +----------+ | caudyfly | | dicty | | gadfly | | homologs | | master | | mysql | | poetry | | yeast | +----------+ 8 rows in set (0.01 sec) mysql> use homologs; Database changed
The next commands show me creating the two tables for the homologs database.
Initially they are empty.
I ask to see the fields that have been defined with
"
mysql> create table genename ( name char(20), id int, date date ); Query OK, 0 rows affected (0.00 sec) mysql> create table organism ( organism char(20), gene char(20) ); Query OK, 0 rows affected (0.00 sec) mysql> show tables; +--------------------+ | Tables_in_homologs | +--------------------+ | genename | | organism | +--------------------+ 2 rows in set (0.00 sec) mysql> show fields from genename; +-------+----------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------+----------+------+-----+---------+-------+ | name | char(20) | YES | | NULL | | | id | int(11) | YES | | NULL | | | date | date | YES | | NULL | | +-------+----------+------+-----+---------+-------+ 3 rows in set (0.00 sec) mysql> show fields from organism; +----------+----------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +----------+----------+------+-----+---------+-------+ | organism | char(20) | YES | | NULL | | | gene | char(20) | YES | | NULL | | +----------+----------+------+-----+---------+-------+ 2 rows in set (0.00 sec) mysql>
Now I've got a new database with two tables defined, and I'm ready to populate the
tables.
First I verify that the
mysql> select * from genename;
Empty set (0.00 sec)
mysql> insert into genename (name,id,date) values ('aging',118,'1984-07-13');
Query OK, 1 row affected (0.00 sec)
mysql> insert into genename (name,id,date) values ('wrinkle',9223,'1987-08-15');
Query OK, 1 row affected (0.00 sec)
mysql> insert into genename (name,id,date) values ('hairy',273,'1990-09-30');
Query OK, 1 row affected (0.01 sec)
mysql> insert into genename (name,id,date) values ('hairy',273,'1990-09-30');
Query OK, 1 row affected (0.01 sec)
mysql> select * from genename;
+---------+------+------------+
| name | id | date |
+---------+------+------------+
| aging | 118 | 1984-07-13 |
| wrinkle | 9223 | 1987-08-15 |
| hairy | 273 | 1990-09-30 |
+---------+------+------------+
3 rows in set (0.00 sec)
Now I repeat the same process to populate the other
mysql> show fields from organism; +----------+----------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +----------+----------+------+-----+---------+-------+ | organism | char(20) | YES | | NULL | | | gene | char(20) | YES | | NULL | | +----------+----------+------+-----+---------+-------+ 2 rows in set (0.00 sec) mysql> insert into organism ( organism, gene ) values ( 'human', 118 ); Query OK, 1 row affected (0.00 sec) mysql> insert into organism ( organism, gene ) values ( 'human', 9223 ); Query OK, 1 row affected (0.00 sec) mysql> insert into organism ( organism, gene ) values ( 'mouse', 9223 ); Query OK, 1 row affected (0.01 sec) mysql> insert into organism ( organism, gene ) values ( 'mouse', 273 ); Query OK, 1 row affected (0.00 sec) mysql> insert into organism ( organism, gene ) values ( 'worm', 118 ); Query OK, 1 row affected (0.00 sec) mysql> select * from organism; +----------+------+ | organism | gene | +----------+------+ | human | 118 | | human | 9223 | | mouse | 9223 | | mouse | 273 | | worm | 118 | +----------+------+ 5 rows in set (0.00 sec)
Let's find out what organisms have a homolog of the "wrinkle" gene. First we do it in two stages, getting the id of the gene and then searching in the ORGANISM table. Then we do it as a single SQL statement.
mysql> select id from genename where name = 'wrinkle';
+------+
| id |
+------+
| 9223 |
+------+
1 row in set (0.00 sec)
mysql> select organism from organism where gene = 9223;
+----------+
| organism |
+----------+
| human |
| mouse |
+----------+
2 rows in set (0.00 sec)
mysql> select organism from organism, genename
-> where genename.name = 'wrinkle' and genename.id = organism.gene;
+----------+
| organism |
+----------+
| human |
| mouse |
+----------+
2 rows in set (0.00 sec)
mysql>
mysql>
Notice how the last statement asks the same question as the two preceeding statements combined.
So this small example has shown how we can, just by typing at the interactive MySQL program, create, populate, and query a database using SQL statements. Later in the chapter we'll learn how to perform the SQL populate and query commands by means of a Perl program.
If you have to install and maintain your own database, whether
MySQL or some other, you will be performing the task called
Database administration encompasses such tasks as installing
and configuring the DBMS, backing up the data,
adding users and setting their various permissions,
applying updates or new capabilities to the system,
and similar tasks.
If you just have yourself and a fairly small lab to deal
with, it's not too bad.
But organizations often hire
one or more
If you are a beginning programmer, and you need to install and maintain a database management system, then you have a little work ahead to read the manuals and learn the tools. Even if you have some computer administration experience, there is a bit of learning involved. The very best thing you can do is to get the help of an experienced database administrator.
Failing such expert help, it's necessary to get good
documentation and to follow it.
This depends on the system you're using, of course.
If MySQL, some recommendations are provided in
Appendix A,
One function that a database administrator of more than a single-user system needs to know is how to add users and set their permissions, that is, what operations they're allowed to perform, and what resources they're allowed to view or change. In MySQL, for example, each user needs an account name and a password for access (and these are not tied into the rest of the account names and passwords on the computer system.) Security can be important as well. You may use the database to manage your new data and results, which you don't want to release to the public just yet; and at the same time you may be providing the public, through a web site, access to your more established data and results. A system such as MySQL provides several tools to set up and manage accounts and security permissions.
One essential task for any computer systems effort, including working with databases, is to backup your work. All computers will break; every disk drive will "crash" and become inoperable. If you don't have timely backups of your data, you will lose it, certainly.
There are a variety of ways to backup data. Even MySQL has more than one method. But even if you backup your data from the database into a backup file, it's still necessary to make a copy of the backup file in some other location than on the same hard disk. For the small to medium project, it's possible to run a program that simply makes a text file containing MySQL commands that can be used to repopulate your database. This is often a convenient and workable method. (See the MySQL documentation for alternatives, however.)
Here, then, is how you can make a backup or "dump" of a database, in this case to the disk file
[tisdall@coltrane development]$ mysqldump homologs -u tisdall -p > homologs.dump Enter password: [tisdall@coltrane development]$
After that command a dump file called
[tisdall@coltrane development]$ cat homologs.dump
# MySQL dump 8.14
#
# Host: localhost Database: homologs
#--------------------------------------------------------
# Server version 3.23.41
#
# Table structure for table 'genename'
#
CREATE TABLE genename (
name char(20) default NULL,
id int(11) default NULL,
date date default NULL
) TYPE=MyISAM;
#
# Dumping data for table 'genename'
#
INSERT INTO genename VALUES ('aging',118,'1984-07-13');
INSERT INTO genename VALUES ('wrinkle',9223,'1987-08-15');
INSERT INTO genename VALUES ('hairy',273,'1990-09-30');
#
# Table structure for table 'organism'
#
CREATE TABLE organism (
organism char(20) default NULL,
gene char(20) default NULL
) TYPE=MyISAM;
#
# Dumping data for table 'organism'
#
INSERT INTO organism VALUES ('human','118');
INSERT INTO organism VALUES ('human','9223');
INSERT INTO organism VALUES ('mouse','9223');
INSERT INTO organism VALUES ('mouse','273');
INSERT INTO organism VALUES ('worm','118');
[tisdall@coltrane development]$
Once you've backed up your data, you can drop the database (wiping out the data) and then create the (empty) database again. Then you can use your dump file to redefine the tables and repopulate them with your saved rows of data.
Here's how you might delete a database and reload it from your dump backup file
mysql> drop database homologs; Query OK, 0 rows affected (0.00 sec) mysql> show databases; +----------+ | Database | +----------+ | caudyfly | | dicty | | gadfly | | master | | mysql | | poetry | | yeast | +----------+ 7 rows in set (0.00 sec) mysql> create database homologs; Query OK, 1 row affected (0.00 sec) mysql> use homologs; Database changed mysql> source homologs.dump; Query OK, 0 rows affected (0.00 sec) Query OK, 1 row affected (0.00 sec) Query OK, 1 row affected (0.00 sec) Query OK, 1 row affected (0.00 sec) Query OK, 0 rows affected (0.00 sec) Query OK, 1 row affected (0.00 sec) Query OK, 1 row affected (0.00 sec) Query OK, 1 row affected (0.00 sec) Query OK, 1 row affected (0.00 sec) Query OK, 1 row affected (0.00 sec) mysql> show tables; +--------------------+ | Tables_in_homologs | +--------------------+ | genename | | organism | +--------------------+ 2 rows in set (0.00 sec) mysql> select * from genename; +---------+------+------------+ | name | id | date | +---------+------+------------+ | aging | 118 | 1984-07-13 | | wrinkle | 9223 | 1987-08-15 | | hairy | 273 | 1990-09-30 | +---------+------+------------+ 3 rows in set (0.00 sec) mysql> select * from organism; +----------+------+ | organism | gene | +----------+------+ | human | 118 | | human | 9223 | | mouse | 9223 | | mouse | 273 | | worm | 118 | +----------+------+ 5 rows in set (0.00 sec) mysql>
SQL is a fairly simple and easy-to-learn language, considered by most to be well-tailored to its task.
However, there are many things that are available in most programming languages, such as control flow ("while", "for", "foreach") and conditional branches ("if-else") that are not provided in most implementations of the language. The lack of these abilities severely restricts the use of SQL as a standalone language.
Most programs that make use of a relational database are written in some other language like Perl. From Perl the programmer does the interaction with the user, with the screen, with files, with the web server, and so on. Perl also provides the program logic. And the interaction with the database is typically to execute some database commands, such as fetching data from the database and processing it using Perl's capabilities. The logic of the program may depend on the data found in the database, but it is Perl, not SQL, that provides this logic (for the most part.)
In Perl a set of modules have been written that allow interaction with relational databases. DataBase Independent (DBI) is a module that handles most of the interaction from the program code; DataBase Dependent (or DataBase Driver) (DBD) is a set of modules, different for each particular DBMS, that handles actually communicating with the DBMS.
To use a MySQL database from Perl you need first to have MySQL installed and properly configured. This is not a Perl job, but a database administration job: you have to get MySQL and install it on your system and set up the appropriate user accounts and permissions.
Then you have to install the Perl DBD driver for MySQL. DETAILS.
Finally, you have to have the Perl DBI module installed.
Its home page is
This combination of MySQL (the DBMS), DBD (the particular driver for your DBMS), and DBI (the Perl interface to the DBI and DBMS), is what gives you the actual connection from Perl to the database, and enables you to send SQL statements to the database and retrieve results.
Let's say you have the components installed (MySQL, Perl, DBD, DBI), and you want to write a program that "talks to" the database. We'll assume that we've implemented the last, normalized version of the "HOMOLOGS" database as shown previously, and walk through a small example that shows how to read data in from a file, populate a database, send queries, and retrieve results.
First, here is the data as we might find it in a file. All the whitespace between the words is the tab character in the file, not space characters.
TABLE ORGANISM OrgId Organism 1 human 2 worm 3 mouse TABLE GENES GeneId Gene Date 118 aging 1984-07-13 9223 wrinkle 1987-08-15 273 hairy 1990-09-30 TABLE VARIANTS VarId OrgId GeneId 1 1 118 2 2 118 3 1 9223 4 3 9223 5 3 273
There are several ways you might find the data that you want in a database. It's very common to have it in a plain file that has tables represented by lines, one table row on each line, with the field values separated by tabs or some other character that doesn't appear in any of the values of any field.
We'll write a short Perl program that reads such a file in and uses the data to send "insert" statements to a MySQL database, thereby populating the database.
You should bear in mind that this is just one of several possibilities for the source and the format
of your input data.
See the interesting book
SQL itself provides a utility for this purpose, called
To start developing such a utility, here is a short program that reads the file and knows the table it's reading, the field names, and the data for each row:
#!/usr/bin/perl
use strict;
use warnings;
my $flag = 0;
my $table;
my @table;
my @fieldnames;
my @fields;
while(<>) {
if(/^\s*$/) {
# skip blank lines
;
}elsif(/^TABLE\t(\w+)/) {
# output previous table
print(@table) if $flag;
$flag = 1;
# begin new table
@table = ();
$table = $1;
push(@table, "\nTable is $table\n");
} elsif($flag == 1) {
@fieldnames = split;
$flag = 2;
push(@table, "Fields are ", join("|", @fieldnames), "\n");
} elsif($flag == 2) {
@fields = split;
push(@table, join("|", @fields) . "\n");
}
}
# output last table
print @table;
This program understands the file format we gave previously, reads it in, and then reformats it and prints it out. It's just an example of how you might read in data. In the following, we'll modify this program to read in the file, but instead of printing out the (reformatted) tables, it will send SQL commands to the MySQL database to insert the data into the appropriate tables.
As you see, this first version of the program keeps track of what it's reading by means of the
Otherwise, if the
Finally, if the
And when all the input is done, and the
If we call this program "homologs.getdata" and give it our data file "homologs.tabs" like so:
% perl homologs.getdata homologs.tabs
we get the following output:
Table is ORGANISM Fields are OrgId|Organism 1|human 2|worm 3|mouse Table is GENES Fields are GeneId|Gene|Date 118|aging|1984-07-13 9223|wrinkle|1987-08-15 273|hairy|1990-09-30 Table is VARIANTS Fields are VarId|OrgId|GeneId 1|1|118 2|2|118 3|1|9223 4|3|9223 5|3|273
Notice that all we've really done here is read in the data and printed it out in a slightly different format; among other things we've changed the delimiter between fields from a tab to a vertical bar, a common type of task with these database dumps. But now let's see how to interact with an actual database.
Now let's take the "homologs.getdata" program from the previous section and add the DBI calls to the MySQL database that will serve to populate the MySQL database with the data that we are reading in.
For starters, let's just see a Perl program that connects to the database, asks a simple question ("What tables are in this database?"), displays the results, and disconnects.
#!/usr/bin/perl
use strict;
use warnings;
# Make connection with MySQL database
use DBI;
my $database = 'homologs';
my $server = 'localhost';
my $user = 'tisdall';
my $passwd = 'NOTmyPASSWORD';
my $homologs = DBI->connect("dbi:mysql:$database:$server", $user, $passwd);
# prepare an SQL statement
my $query = "show tables";
my $sql = $homologs->prepare($query);
# execute an SQL statement
$sql->execute();
# retrieve and print results
while (my $row = $sql->fetchrow_arrayref) {
print join("\t", @$row), "\n";
}
# Break connection with MySQL database
$homologs->disconnect;
exit;
Here's the result of running this program (showing the tables on my computer at some moment):
GENES ORGANISM VARIANTS
This program does the basic tasks that all DBI programs have to do, so let's examine them in useful detail:
After the obligatory
The actual connection happens here:
my $homologs = DBI->connect("dbi:mysql:$database:$server", $user, $passwd);
The
The output of the
Next the DBI object prepares an SQL statement, and the result is saved as a new
statement object, which I call
The actual SQL here is a very simple one.
You've already seen that
After
The last DBI call is to
The program you've just seen is the kind of sample program you should run when you
first try to install and use the DBI module.
If it works, you're in business.
If not, you have to closely examine the error messages you get out to identify where
the problem is.
One thing that can help is to add an additional argument to the
my $homologs = DBI->connect(
"dbi:mysql:$database:$server", $user, $passwd, {RaiseError=>1}
);
This will terminate the program with extra error messages if the
You've now seen a program that simply connects to the database and executes a simple SQL statement. Next, here is a program that does a little more: it reads in the tab-delimited file "homologs.tabs" that we've seen previously, extracts the data, and uses it to load the tables in a MySQL database. Read it over: most of it will be familiar from previous programs in this chapter.
#!/usr/bin/perl
use strict;
use warnings;
# Make connection with MySQL database
use DBI;
my $database = 'homologs';
my $server = 'localhost';
my $user = 'tisdall';
my $passwd = 'NOTmyPASSWORD';
my $homologs = DBI->connect("dbi:mysql:$database:$server", $user, $passwd);
my $sqlinit = $homologs->prepare("show tables");
$sqlinit->execute();
while (my $row = $sqlinit->fetchrow_arrayref) {
print join("\t", @$row), "\n";
}
my $flag = 0;
my $table;
my @tables;
my $sql;
while(<>) {
# skip blank lines
if(/^\s*$/) {
next;
# begin new table
}elsif(/^TABLE\t(\w+)/) {
$flag = 1;
$table = $1;
push(@tables, $table);
# Delete all rows in database table
my $droprows = $homologs->prepare("delete from $table");
$droprows->execute();
# get fieldnames, prepare SQL statement
} elsif($flag == 1) {
$flag = 2;
my @fieldnames = split;
my $query = "insert into $table ("
. join(",", @fieldnames)
. ") values ("
. "?, " x (@fieldnames-1)
. "?)";
$sql = $homologs->prepare($query);
# get row, execute SQL statement
} elsif($flag == 2) {
my @fields = split;
$sql->execute( @fields);
}
}
# Check if tables were updated
foreach my $table (@tables) {
my $query = "select * from $table";
my $sql = $homologs->prepare($query);
$sql->execute();
while (my $row = $sql->fetchrow_arrayref) {
print join("\t", @$row), "\n";
}
}
# Break connection with MySQL database
$homologs->disconnect;
exit;
This program is called by giving it the name of the tab-delimited file on the command line (the same file used previously with the "homologs.getdata" program):
% perl homologs.load homologs.tabs
and this is the output of the program:
GENES ORGANISM VARIANTS Table: ORGANISM 1 human 2 worm 3 mouse Table: GENES 118 aging 1984-07-13 9223 wrinkle 1987-08-15 273 hairy 1990-09-30 Table: VARIANTS 1 1 118 2 2 118 3 1 9223 4 3 9223 5 3 273
This "homologs.load" program is very much like the previous "homologs.getdata" program.
Now instead of building an output array
Notice that each time "homologs.load" finds a new table, it first empties all rows of that table in the database; then it proceeds to read in the lines of data and insert new rows into the table.
Notice also that when the program reads the line of the input file that names the
fields (when $flag equals 1) it just does the
The SQL statement that is the argument to the
my $query = "insert into $table ("
. join(",", @fieldnames)
. ") values ("
. "?, " x (@fieldnames-1)
. "?)";
This is a bit hard to read at first sight. However, it is typical of what happens when you use one language (Perl) to make a statement in another language (SQL). So let's explain this one carefully as a good example of the breed.
The SQL query is being formed by five strings that are joined by the dot (.) string operator:
recall that
"insert into $table ("
join(",", @fieldnames)
") values ("
"?, " x (@fieldnames-1)
"?)"
The question marks in the statement are
If that query is called with
"insert into EXONS (Exon,Position) values (?, ?)"
How does this statement get constructed? Let's look at it in detail.
The first string
The second string
The third string
The fourth string
The fifth string
And so the final result is
As the Perl program reads in the data rows from the input file (when $flag equals 2), the values are
placed in the array
To check what actually happened to the database after the reading in and processing of the file is
complete, the program sends an SQL query for each table to see what's in it.
This is done with the SQL
Here, we're asking to see all fields (*) from the database table, and since no restrictions are added,
SQL will show us all the fields of all the rows.
As before, the result of this SQL query is read using the
This last program
Exercise 1
What's the difference between a
Exercise 2
Compare MySQL with some other relational database management system; include such management issues as cost of purchase, cost of maintenance, availability of skilled personnel, stability of vendor in the marketplace, customer support.
Exercise 3
Name two bioinformatics problems that are ill-served by the data structures of a relational database.
Exercise 4
Connect to the mysql DBMS (use "mysql"), create a new database and a table in that database, insert some data, and perform some queries.
Exercise 5
Implement a relational database that supports your project.
Bioinformatics: Writing Software for Genome Research
Jim Tisdall, James.Tisdall@dupont.com
Last modified: Sun Oct 20 15:36:27 EDT 2002