Lecture

Calling Relational Databases from Perl

James Tisdall

Bioinformatics: Writing Software for Genome Research

Suggested Reading

  1. SQL in 10 Minutes
  2. The Perl Cookbook 2nd edition Section 14.9 "Executing an SQL Command Using DBI
  3. Chapter 6 "Perl and Relational Databases" of Mastering Perl for Bioinformatics
  4. Programming the Perl DBI by Alligator Descartes and Tim Bunce
  5. MySQL manual
  6. MySQL homepage

Lecture Notes

One Perl, many databases

See Appendix B of SQL in 10 Minutes for the main differences between the most popular DBMS.

Luckily, thanks to some expert Perl programming, there is a way to (mostly) get around the proliferation of different DBMS with their special ways of doing things and their special extensions of SQL. In this chapter we'll use the Perl DBI (DataBase Independent) module that provides a common interface to different relational database systems; it makes it possible to write SQL that will run on many different relational database systems with little or no change.

Structured query language

First, I enter the interactive mysql program, providing my MySQL username ("tisdall") and interactively entering my MySQL account password:

[tisdall@coltrane tisdall]$ mysql -u tisdall -p
Enter password:
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2 to server version: 3.23.41

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

Next I ask to be shown a list of all the databases that are defined in my MySQL DBMS:

mysql> show databases;
+----------+
| Database |
+----------+
| caudyfly |
| dicty    |
| gadfly   |
| master   |
| mysql    |
| poetry   |
| yeast    |
+----------+
7 rows in set (0.15 sec)

I want to create a database called "homologs". First I create it, then check that it's there, and then I make it the active database with use homologs;:


mysql> create database homologs;
Query OK, 1 row affected (0.00 sec)

mysql> show databases;
+----------+
| Database |
+----------+
| caudyfly |
| dicty    |
| gadfly   |
| homologs |
| master   |
| mysql    |
| poetry   |
| yeast    |
+----------+
8 rows in set (0.01 sec)

mysql> use homologs;
Database changed

The next commands show me creating the two tables for the homologs database. Initially they are empty. I ask to see the fields that have been defined with "show fields" ("show full columns" would also work):

mysql> create table genename ( name char(20), id int, date date );
Query OK, 0 rows affected (0.00 sec)

mysql> create table organism ( organism char(20), gene char(20) );
Query OK, 0 rows affected (0.00 sec)

mysql> show tables;
+--------------------+
| Tables_in_homologs |
+--------------------+
| genename           |
| organism           |
+--------------------+
2 rows in set (0.00 sec)

mysql> show fields from genename;
+-------+----------+------+-----+---------+-------+
| Field | Type     | Null | Key | Default | Extra |
+-------+----------+------+-----+---------+-------+
| name  | char(20) | YES  |     | NULL    |       |
| id    | int(11)  | YES  |     | NULL    |       |
| date  | date     | YES  |     | NULL    |       |
+-------+----------+------+-----+---------+-------+
3 rows in set (0.00 sec)

mysql> show fields from organism;
+----------+----------+------+-----+---------+-------+
| Field    | Type     | Null | Key | Default | Extra |
+----------+----------+------+-----+---------+-------+
| organism | char(20) | YES  |     | NULL    |       |
| gene     | char(20) | YES  |     | NULL    |       |
+----------+----------+------+-----+---------+-------+
2 rows in set (0.00 sec)

mysql>

Now I've got a new database with two tables defined, and I'm ready to populate the tables. First I verify that the genename table is empty by making a "select" command; then I issue three "insert" commands, one for each row that I want to insert in the table. After inserting, I verify that the genename table now has the desired three rows by means of a "select" command:

mysql> select * from genename;
Empty set (0.00 sec)

mysql> insert into genename (name,id,date) values ('aging',118,'1984-07-13');
Query OK, 1 row affected (0.00 sec)

mysql> insert into genename (name,id,date) values ('wrinkle',9223,'1987-08-15');
Query OK, 1 row affected (0.00 sec)

mysql> insert into genename (name,id,date) values ('hairy',273,'1990-09-30');
Query OK, 1 row affected (0.01 sec)
mysql> insert into genename (name,id,date) values ('hairy',273,'1990-09-30');
Query OK, 1 row affected (0.01 sec)

mysql> select * from genename;
+---------+------+------------+
| name    | id   | date       |
+---------+------+------------+
| aging   |  118 | 1984-07-13 |
| wrinkle | 9223 | 1987-08-15 |
| hairy   |  273 | 1990-09-30 |
+---------+------+------------+
3 rows in set (0.00 sec)

Now I repeat the same process to populate the other organism table:

mysql> show fields from organism;
+----------+----------+------+-----+---------+-------+
| Field    | Type     | Null | Key | Default | Extra |
+----------+----------+------+-----+---------+-------+
| organism | char(20) | YES  |     | NULL    |       |
| gene     | char(20) | YES  |     | NULL    |       |
+----------+----------+------+-----+---------+-------+
2 rows in set (0.00 sec)
mysql> insert into organism ( organism, gene ) values ( 'human', 118 );
Query OK, 1 row affected (0.00 sec)

mysql> insert into organism ( organism, gene ) values ( 'human', 9223 );
Query OK, 1 row affected (0.00 sec)

mysql> insert into organism ( organism, gene ) values ( 'mouse', 9223 );
Query OK, 1 row affected (0.01 sec)

mysql> insert into organism ( organism, gene ) values ( 'mouse', 273 );
Query OK, 1 row affected (0.00 sec)

mysql> insert into organism ( organism, gene ) values ( 'worm', 118 );
Query OK, 1 row affected (0.00 sec)

mysql> select * from organism;
+----------+------+
| organism | gene |
+----------+------+
| human    | 118  |
| human    | 9223 |
| mouse    | 9223 |
| mouse    | 273  |
| worm     | 118  |
+----------+------+
5 rows in set (0.00 sec)

Let's find out what organisms have a homolog of the "wrinkle" gene. First we do it in two stages, getting the id of the gene and then searching in the ORGANISM table. Then we do it as a single SQL statement.

mysql> select id from genename where name = 'wrinkle';
+------+
| id   |
+------+
| 9223 |
+------+
1 row in set (0.00 sec)

mysql> select organism from organism where gene = 9223;
+----------+
| organism |
+----------+
| human    |
| mouse    |
+----------+
2 rows in set (0.00 sec)

mysql> select organism from organism, genename
    -> where genename.name = 'wrinkle' and genename.id = organism.gene;
+----------+
| organism |
+----------+
| human    |
| mouse    |
+----------+
2 rows in set (0.00 sec)

mysql> 

mysql> 

Notice how the last statement asks the same question as the two preceeding statements combined.

So this small example has shown how we can, just by typing at the interactive MySQL program, create, populate, and query a database using SQL statements. Later in the chapter we'll learn how to perform the SQL populate and query commands by means of a Perl program.

A short course in database administration

If you have to install and maintain your own database, whether MySQL or some other, you will be performing the task called database administration.

Database administration encompasses such tasks as installing and configuring the DBMS, backing up the data, adding users and setting their various permissions, applying updates or new capabilities to the system, and similar tasks. If you just have yourself and a fairly small lab to deal with, it's not too bad. But organizations often hire one or more database administrators to do this work full-time; even a smallish project, if it's critical to have it done very well, and the budget exists, can benefit from the attention of a professional database administrator.

If you are a beginning programmer, and you need to install and maintain a database management system, then you have a little work ahead to read the manuals and learn the tools. Even if you have some computer administration experience, there is a bit of learning involved. The very best thing you can do is to get the help of an experienced database administrator.

Failing such expert help, it's necessary to get good documentation and to follow it. This depends on the system you're using, of course. If MySQL, some recommendations are provided in Appendix A, Resources.

One function that a database administrator of more than a single-user system needs to know is how to add users and set their permissions, that is, what operations they're allowed to perform, and what resources they're allowed to view or change. In MySQL, for example, each user needs an account name and a password for access (and these are not tied into the rest of the account names and passwords on the computer system.) Security can be important as well. You may use the database to manage your new data and results, which you don't want to release to the public just yet; and at the same time you may be providing the public, through a web site, access to your more established data and results. A system such as MySQL provides several tools to set up and manage accounts and security permissions.

One essential task for any computer systems effort, including working with databases, is to backup your work. All computers will break; every disk drive will "crash" and become inoperable. If you don't have timely backups of your data, you will lose it, certainly.

There are a variety of ways to backup data. Even MySQL has more than one method. But even if you backup your data from the database into a backup file, it's still necessary to make a copy of the backup file in some other location than on the same hard disk. For the small to medium project, it's possible to run a program that simply makes a text file containing MySQL commands that can be used to repopulate your database. This is often a convenient and workable method. (See the MySQL documentation for alternatives, however.)

Here, then, is how you can make a backup or "dump" of a database, in this case to the disk file homologs.dump:

[tisdall@coltrane development]$ mysqldump homologs -u tisdall -p > homologs.dump
Enter password: 
[tisdall@coltrane development]$ 

After that command a dump file called homologs.dump has been created. Here's what it looks like for our little two table database:

[tisdall@coltrane development]$ cat homologs.dump
# MySQL dump 8.14
#
# Host: localhost    Database: homologs
#--------------------------------------------------------
# Server version	3.23.41

#
# Table structure for table 'genename'
#

CREATE TABLE genename (
  name char(20) default NULL,
  id int(11) default NULL,
  date date default NULL
) TYPE=MyISAM;

#
# Dumping data for table 'genename'
#

INSERT INTO genename VALUES ('aging',118,'1984-07-13');
INSERT INTO genename VALUES ('wrinkle',9223,'1987-08-15');
INSERT INTO genename VALUES ('hairy',273,'1990-09-30');

#
# Table structure for table 'organism'
#

CREATE TABLE organism (
  organism char(20) default NULL,
  gene char(20) default NULL
) TYPE=MyISAM;

#
# Dumping data for table 'organism'
#

INSERT INTO organism VALUES ('human','118');
INSERT INTO organism VALUES ('human','9223');
INSERT INTO organism VALUES ('mouse','9223');
INSERT INTO organism VALUES ('mouse','273');
INSERT INTO organism VALUES ('worm','118');

[tisdall@coltrane development]$ 

Once you've backed up your data, you can drop the database (wiping out the data) and then create the (empty) database again. Then you can use your dump file to redefine the tables and repopulate them with your saved rows of data.

Here's how you might delete a database and reload it from your dump backup file homologs.dump:

mysql> drop database homologs;
Query OK, 0 rows affected (0.00 sec)

mysql> show databases;
+----------+
| Database |
+----------+
| caudyfly |
| dicty    |
| gadfly   |
| master   |
| mysql    |
| poetry   |
| yeast    |
+----------+
7 rows in set (0.00 sec)

mysql> create database homologs;
Query OK, 1 row affected (0.00 sec)

mysql> use homologs;
Database changed
mysql> source homologs.dump;
Query OK, 0 rows affected (0.00 sec)

Query OK, 1 row affected (0.00 sec)

Query OK, 1 row affected (0.00 sec)

Query OK, 1 row affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 1 row affected (0.00 sec)

Query OK, 1 row affected (0.00 sec)

Query OK, 1 row affected (0.00 sec)

Query OK, 1 row affected (0.00 sec)

Query OK, 1 row affected (0.00 sec)

mysql> show tables;
+--------------------+
| Tables_in_homologs |
+--------------------+
| genename           |
| organism           |
+--------------------+
2 rows in set (0.00 sec)

mysql> select * from genename;
+---------+------+------------+
| name    | id   | date       |
+---------+------+------------+
| aging   |  118 | 1984-07-13 |
| wrinkle | 9223 | 1987-08-15 |
| hairy   |  273 | 1990-09-30 |
+---------+------+------------+
3 rows in set (0.00 sec)

mysql> select * from organism;
+----------+------+
| organism | gene |
+----------+------+
| human    | 118  |
| human    | 9223 |
| mouse    | 9223 |
| mouse    | 273  |
| worm     | 118  |
+----------+------+
5 rows in set (0.00 sec)

mysql> 

DBI and DBD Perl modules

Limitations of SQL

SQL is a fairly simple and easy-to-learn language, considered by most to be well-tailored to its task.

However, there are many things that are available in most programming languages, such as control flow ("while", "for", "foreach") and conditional branches ("if-else") that are not provided in most implementations of the language. The lack of these abilities severely restricts the use of SQL as a standalone language.

Most programs that make use of a relational database are written in some other language like Perl. From Perl the programmer does the interaction with the user, with the screen, with files, with the web server, and so on. Perl also provides the program logic. And the interaction with the database is typically to execute some database commands, such as fetching data from the database and processing it using Perl's capabilities. The logic of the program may depend on the data found in the database, but it is Perl, not SQL, that provides this logic (for the most part.)

DBI and DBD interface to Perl

In Perl a set of modules have been written that allow interaction with relational databases. DataBase Independent (DBI) is a module that handles most of the interaction from the program code; DataBase Dependent (or DataBase Driver) (DBD) is a set of modules, different for each particular DBMS, that handles actually communicating with the DBMS.

To use a MySQL database from Perl you need first to have MySQL installed and properly configured. This is not a Perl job, but a database administration job: you have to get MySQL and install it on your system and set up the appropriate user accounts and permissions.

Then you have to install the Perl DBD driver for MySQL. DETAILS.

Finally, you have to have the Perl DBI module installed. Its home page is http://www.symbolstone.org/technology/perl/DBI. DETAILS.

This combination of MySQL (the DBMS), DBD (the particular driver for your DBMS), and DBI (the Perl interface to the DBI and DBMS), is what gives you the actual connection from Perl to the database, and enables you to send SQL statements to the database and retrieve results.

Let's say you have the components installed (MySQL, Perl, DBD, DBI), and you want to write a program that "talks to" the database. We'll assume that we've implemented the last, normalized version of the "HOMOLOGS" database as shown previously, and walk through a small example that shows how to read data in from a file, populate a database, send queries, and retrieve results.

Tab-delimited input files

First, here is the data as we might find it in a file. All the whitespace between the words is the tab character in the file, not space characters.

TABLE	ORGANISM
OrgId	Organism
1	human
2	worm
3	mouse

TABLE	GENES
GeneId	Gene	Date
118	aging	1984-07-13
9223	wrinkle	1987-08-15
273	hairy	1990-09-30

TABLE	VARIANTS
VarId	OrgId	GeneId
1	1	118
2	2	118
3	1	9223
4	3	9223
5	3	273

There are several ways you might find the data that you want in a database. It's very common to have it in a plain file that has tables represented by lines, one table row on each line, with the field values separated by tabs or some other character that doesn't appear in any of the values of any field.

We'll write a short Perl program that reads such a file in and uses the data to send "insert" statements to a MySQL database, thereby populating the database.

You should bear in mind that this is just one of several possibilities for the source and the format of your input data. See the interesting book Data Munging With Perl in Appendix A for lots of useful lore about getting data in and out of various sources.

SQL itself provides a utility for this purpose, called load, which assumes that you have a file consisting of only rows of data. You can specify what columns to load, what delimiter the file uses (tab by default), and a few other options. Its performance is optimized, and it is much faster than executing several SQL "insert" statements. But there is still a need for a more general purpose ability to read data in from files in different formats: what better than your own program that you can alter to suit any occasion?

To start developing such a utility, here is a short program that reads the file and knows the table it's reading, the field names, and the data for each row:

#!/usr/bin/perl

use strict;
use warnings;

my $flag = 0;
my $table;
my @table;
my @fieldnames;
my @fields;

while(<>) {
    if(/^\s*$/) {
        # skip blank lines
        ;
    }elsif(/^TABLE\t(\w+)/) {
        # output previous table
        print(@table) if $flag;
        $flag = 1;
        # begin new table
        @table = ();
        $table = $1;
	push(@table, "\nTable is $table\n");
    } elsif($flag == 1) {
        @fieldnames = split;
        $flag = 2;
	push(@table, "Fields are ", join("|", @fieldnames), "\n");
    } elsif($flag == 2) {
        @fields = split;
	push(@table, join("|", @fields) . "\n");
    }
}

# output last table
print @table;

This program understands the file format we gave previously, reads it in, and then reformats it and prints it out. It's just an example of how you might read in data. In the following, we'll modify this program to read in the file, but instead of printing out the (reformatted) tables, it will send SQL commands to the MySQL database to insert the data into the appropriate tables.

As you see, this first version of the program keeps track of what it's reading by means of the $flag variable. Every time the input line begins with TABLE\t (that \t is a tab that actually shows up as whitespace) the program will output the previously read table (if $flag indicates there was one.) It then saves the next word as the table's name, sets the $flag to 1, and prepares some output in the array @table.

Otherwise, if the $flag variable is set to 1, the program knows it's on the second line of a table (remember, this program is specially written for the input file format we gave previously). In this case it saves the names of the fields in an array, and then reformats them and adds them to the @table output array.

Finally, if the $flag variable is set to 2, the program knows it's reading rows of the table; it reformats them and adds them to the @table output array.

And when all the input is done, and the while loop finishes, there will be the last table's reformatted output ready to be printed from the @table output array.

If we call this program "homologs.getdata" and give it our data file "homologs.tabs" like so:

% perl homologs.getdata homologs.tabs

we get the following output:


Table is ORGANISM
Fields are OrgId|Organism
1|human
2|worm
3|mouse

Table is GENES
Fields are GeneId|Gene|Date
118|aging|1984-07-13
9223|wrinkle|1987-08-15
273|hairy|1990-09-30

Table is VARIANTS
Fields are VarId|OrgId|GeneId
1|1|118
2|2|118
3|1|9223
4|3|9223
5|3|273

Notice that all we've really done here is read in the data and printed it out in a slightly different format; among other things we've changed the delimiter between fields from a tab to a vertical bar, a common type of task with these database dumps. But now let's see how to interact with an actual database.

A DBI example

Now let's take the "homologs.getdata" program from the previous section and add the DBI calls to the MySQL database that will serve to populate the MySQL database with the data that we are reading in.

For starters, let's just see a Perl program that connects to the database, asks a simple question ("What tables are in this database?"), displays the results, and disconnects.

#!/usr/bin/perl

use strict;
use warnings;

# Make connection with MySQL database

use DBI;

my $database = 'homologs';
my $server = 'localhost';
my $user = 'tisdall';
my $passwd = 'NOTmyPASSWORD';

my $homologs = DBI->connect("dbi:mysql:$database:$server", $user, $passwd);

# prepare an SQL statement

my $query = "show tables";
my $sql = $homologs->prepare($query);

# execute an SQL statement

$sql->execute();

# retrieve and print results

while (my $row = $sql->fetchrow_arrayref) {
    print join("\t", @$row), "\n";
}

# Break connection with MySQL database

$homologs->disconnect;

exit;

Here's the result of running this program (showing the tables on my computer at some moment):

GENES
ORGANISM
VARIANTS

This program does the basic tasks that all DBI programs have to do, so let's examine them in useful detail:

After the obligatory use DBI; that loads the DBI module, I declare some variables to hold the string that specifies to DBI who is connecting to what, and where.

The actual connection happens here:

my $homologs = DBI->connect("dbi:mysql:$database:$server", $user, $passwd);

The connect method is asked to connect to the particular database (in this case the "homologs" database) on the local computer ("localhost": this could be replaced by a url of another computer) as the user with the given password. mysql is specified: if you are using another DBMS, you will want to change this to name your type of database system. Other optional arguments, not used here, are possible (see the documentation.) The connect method is the DBI "new" method that creates (and initializes) a DBI object.

The output of the connect call is saved as a new DBI object in the reference variable $homologs.

Next the DBI object prepares an SQL statement, and the result is saved as a new statement object, which I call $sql here. This statement object $sql then calls its own execute method which actually does the job of sending the SQL to the database.

The actual SQL here is a very simple one. You've already seen that DBI->connect specifies the particular database on your MySQL that you want to use ("homologs" in this case). So this SQL statement show tables is simply asking for a list of the names of the tables that are defined in that database.

After execute, the program retrieves the results of executing the SQL statement. There are several ways of retrieving results. Here, the statement object method fetchrow_arrayref is called in a loop to fetch all the rows of the result; at each pass through the loop, the return value in $row points to the array of fields in that row. Here I simply separate the fields with tab characters with the help of the Perl join function on the dereferenced array @$row, and print the row with a newline.

The last DBI call is to disconnect from the database. This is actually an important call to make; depending on your implementation and how you're running your program, it is sometimes possible to open a number of connections and eventually tax the MySQL DBMS to the point where it has to refuse any more connect requests. Especially if you have an active database with regular queries coming in, you want to disconnect as soon as possible from each connect.

The program you've just seen is the kind of sample program you should run when you first try to install and use the DBI module. If it works, you're in business. If not, you have to closely examine the error messages you get out to identify where the problem is. One thing that can help is to add an additional argument to the connect call that asks for more error reporting, like so:

my $homologs = DBI->connect(
        "dbi:mysql:$database:$server", $user, $passwd, {RaiseError=>1}
);

This will terminate the program with extra error messages if the connect call fails; and this is usually where things go wrong when you first try to use this software. (See the documentation for other such options to connect.) Remember that you need a username and password for MySQL itself, and that these are not related in any way to the username and password on your computer outside of MySQL. You also need to have the database defined (the SQL statement create database life will create a database called "life", for instance) and you have to have your MySQL permissions set properly to allow you to do the things you want to do, such as create or modify databases, or to see other databases on the system. If there's a problem, ask your database administrator, or consult your MySQL documentation or visit the MySQL web site.

You've now seen a program that simply connects to the database and executes a simple SQL statement. Next, here is a program that does a little more: it reads in the tab-delimited file "homologs.tabs" that we've seen previously, extracts the data, and uses it to load the tables in a MySQL database. Read it over: most of it will be familiar from previous programs in this chapter.

#!/usr/bin/perl

use strict;
use warnings;

# Make connection with MySQL database

use DBI;

my $database = 'homologs';
my $server = 'localhost';
my $user = 'tisdall';
my $passwd = 'NOTmyPASSWORD';

my $homologs = DBI->connect("dbi:mysql:$database:$server", $user, $passwd);
my $sqlinit = $homologs->prepare("show tables");
$sqlinit->execute();
while (my $row = $sqlinit->fetchrow_arrayref) {
	print join("\t", @$row), "\n";
}

my $flag = 0;
my $table;
my @tables;
my $sql;

while(<>) {
    # skip blank lines
    if(/^\s*$/) {
        next;

    # begin new table
    }elsif(/^TABLE\t(\w+)/) {
        $flag = 1;
        $table = $1;
	push(@tables, $table);
	# Delete all rows in database table
	my $droprows = $homologs->prepare("delete from $table");
	$droprows->execute();

    # get fieldnames, prepare SQL statement
    } elsif($flag == 1) {
        $flag = 2;
        my @fieldnames = split;
	my $query = "insert into  $table ("
                     . join(",", @fieldnames)
		     . ") values ("
		     . "?, " x (@fieldnames-1)
		     . "?)";
	$sql = $homologs->prepare($query);

    # get row, execute SQL statement
    } elsif($flag == 2) {
        my @fields = split;
	$sql->execute( @fields);
    }
}

# Check if tables were updated

foreach my $table (@tables) {
	my $query = "select * from $table";
	my $sql = $homologs->prepare($query);
	$sql->execute();

	while (my $row = $sql->fetchrow_arrayref) {
	    print join("\t", @$row), "\n";
	}
}

# Break connection with MySQL database

$homologs->disconnect;

exit;

This program is called by giving it the name of the tab-delimited file on the command line (the same file used previously with the "homologs.getdata" program):

% perl homologs.load homologs.tabs

and this is the output of the program:

GENES
ORGANISM
VARIANTS

Table: ORGANISM

1	human
2	worm
3	mouse

Table: GENES

118	aging	1984-07-13
9223	wrinkle	1987-08-15
273	hairy	1990-09-30

Table: VARIANTS

1	1	118
2	2	118
3	1	9223
4	3	9223
5	3	273

This "homologs.load" program is very much like the previous "homologs.getdata" program. Now instead of building an output array @tables and printing the text to the screen, "homologs.load" puts the data into the MySQL database using SQL statements; then when the database is loaded it retrieves the data from the tables and prints it to the screen.

Notice that each time "homologs.load" finds a new table, it first empties all rows of that table in the database; then it proceeds to read in the lines of data and insert new rows into the table.

Notice also that when the program reads the line of the input file that names the fields (when $flag equals 1) it just does the prepare statement. Then, when the actual lines of data are being read (when $flag equals 2) the values of the fields are passed to the execute command which sends the SQL command to the database system.

The SQL statement that is the argument to the prepare method is built up from information that the program knows at that point. Here it is again:

my $query = "insert into  $table ("
             . join(",", @fieldnames)
	     . ") values ("
	     . "?, " x (@fieldnames-1)
	     . "?)";

This is a bit hard to read at first sight. However, it is typical of what happens when you use one language (Perl) to make a statement in another language (SQL). So let's explain this one carefully as a good example of the breed.

The SQL query is being formed by five strings that are joined by the dot (.) string operator: recall that "r" . "DNA" has the value "rDNA". Here are the five strings being joined:

"insert into  $table ("
join(",", @fieldnames)
") values ("
"?, " x (@fieldnames-1)
"?)"

The question marks in the statement are bind variables that will be passed the values given when the execute statement is called.

If that query is called with $table = EXONS and @fieldnames = (Exon, Position) the resulting SQL statement will be

"insert into  EXONS (Exon,Position) values (?, ?)"

How does this statement get constructed? Let's look at it in detail.

The first string just interpolates the table name EXONS into the string.

The second string joins the field names with commas.

The third string appears as is.

The fourth string uses the Perl "x" string operator to make a new string that has a certain number of copies of the original string "?, ". The desired number of copies is specified on the right-hand side of the x operator. The string itself is on the left-hand side of the x operator, namely "?, ".

@fieldnames in a scalar context returns the number of elements of the @fieldnames array; we need one less than that plus an additional question mark (because no commas are allowed after the last item in the list in SQL). So @fieldnames-1 is the desired number of copies of the string.

The fifth string "?)" is just the last question mark and the closing parenthesis.

And so the final result is insert into EXONS (Exon,Position) values (?, ?)

As the Perl program reads in the data rows from the input file (when $flag equals 2), the values are placed in the array @fields and then passed as variables (to take the place of the question marks in the SQL statement that's been prepared) to the execute method, which sends the SQL statement to the database system. These question marks are the bind variables; here we need one for each field, as we'll be passing in the field values when execute is eventually called on this statement.

To check what actually happened to the database after the reading in and processing of the file is complete, the program sends an SQL query for each table to see what's in it. This is done with the SQL select command, which is a general-purpose command to get information out of a database, and has a great many options.

Here, we're asking to see all fields (*) from the database table, and since no restrictions are added, SQL will show us all the fields of all the rows. As before, the result of this SQL query is read using the fetchrow_arrayref DBI method in a while loop, and each resulting row is printed with tab-separated fields.

This last program homologs.load is a typical DBI program, interacting with both the world through Perl (e.g. reading in files, displaying the results to the user) and also interacting with the database through the Perl DBI module and SQL statements.

Problems

Exercise 1

What's the difference between a database and a database management system? What's the difference between MySQL and Oracle?

Exercise 2

Compare MySQL with some other relational database management system; include such management issues as cost of purchase, cost of maintenance, availability of skilled personnel, stability of vendor in the marketplace, customer support.

Exercise 3

Name two bioinformatics problems that are ill-served by the data structures of a relational database.

Exercise 4

Connect to the mysql DBMS (use "mysql"), create a new database and a table in that database, insert some data, and perform some queries.

Exercise 5

Implement a relational database that supports your project.

Bioinformatics: Writing Software for Genome Research


Jim Tisdall, James.Tisdall@dupont.com
Last modified: Sun Oct 20 15:36:27 EDT 2002