Providing Web Information Services I

Lincoln Stein (from notes written by Steve Rozen)

Genome Informatics

Reference Materials

Lincoln Stein, How to Set Up and Maintain a Web Site, 2nd Ed., HTML Quick Reference (next to last page).

Followup Reading

Lincoln Stein, How to Set Up and Maintain a Web Site, 2nd Ed., Chapter 5 (Creating Hypertext Documents), and Chapter 2 (Unraveling the Web: How it All Works) through the subsection "MIME Types and File Extensions".

Lecture Notes

This lecture assumes that you can use a working web server that is already available on your system.

A Simple Web Site

From the Unix side a simple web site is a set of Unix directories containing documents that the web server can deliver to browsers like Netscape and Internet Explorer. The browser specifies which documents it wants by using a URL (web address).

[Freehand Sketch of Web Architecture]

A more sophisticated web site can allow browsers to return information back to the server. (This is what happens when you fill out a web form and then press the "submit" button.) The web server passes this information to programs that the web site designers wrote, and then returns any results produced by the program back to the browser. (We will learn to write such server-side or CGI programs in a subsequent lecture.) A sophisticated web site can also send whole programs to the browser to be executed there. Often these are Java, Javascript, or Active- Xprograms.

Web Servers The most commonly used Unix web server is Apache (www.apache.org). Windows and Mac systems have their own web servers.

Where to Create Your Site To create a simple web site you ask your system administrator to tell you where you should place your web documents (and to make sure you have Unix permissions to put documents there). These documents are likely to be visible to the whole Internet (at least to people who know where to look for them) unless they are behind a firewall. You also have to ask your system administrator what the URL for these documents will be.

For our course create the directory ~/public_html/. Make sure that both your home directory and ~/public_html are world readable and executable (so that the web server, which runs as separate Unix user in a separate Unix group, can read them):

chmod a+rx ~
chmod a+rx ~/public_html
A document X created in your "public_html" directory will have URL http://bush1/~your_user_name/X (e.g. bush1/~srozen/example.html), and will be behind a firewall, so only your fellow classmates and others at CSHL will be able to see them.

A URL (e.g. http://bush1/~srozen/example.html) is composed of three parts:

  1. A protocol, in this case http. (Netscape assumes your protocol is http unless you specify one.)

  2. A host, in this case bush1.

  3. A path, in this case ~srozen/example.html

The path does not correspond exactly to a Unix directory path, but they are usually related by reasonable rules. For example, if you create a subdirectory problem1 of ~/public_html and put file example2.html there you can access the URL http://bush1/~your_user_name/problem1/example2.html. (Make sure to set the permissions so that the web server can read and execute the directory and read the file, as shown above.)

MIME Types Each web document has a MIME type that tells the browser (Netscape) how the document should be displayed. The mime type that is simplest to create is called text/plain. To create a text/plain file simply create a file in your web directory with the extension .txt and put some text in it. The browser presents the text from such a file without any modification or formating.

The web server or the browser often determine the MIME type of document from its extension (called a "suffix" in Netscape preferences). For example, JPEG images have MIME type image/jpeg and have the extensions jpeg, jpg, etc.

Some of the other mime types that you are likely to encounter as a web user (and as a web site designer) are text/html, image/gif, and application/pdf (the format that Acrobat reader uses, and which is popular among electronic journal publishers). Some mime types can be handled by the browser itself. Others must be sent to a plug-in or helper application (plug-in's are somewhat more integrated than helper applications). For example, you need Acrobat reader to view application/pdf files. To provide maximum hassle free usability of your web site stick to MIME types that are built in to Netscape. (For example, many biology lab Macs do not even have Netscape configured to read pdf files.)

Introduction to HTML

HTML (HyperText Markup Language) (MIME type text/html, file extensions html, htm) is still the central MIME type for web documents. Core HTML provides formatting and hyperlinks (and "forms", covered in another lecture.) You can create very complex sites (and many commercial web are very complex), but but core HTML allows you to provide a huge amount of functionality for very little effort.

This is (slightly simplified) a fragment of the HTML from the beginning of this document:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html><head>
<title>Providing Web Information Services I</title>
</head>
<body>
<center>
<h1> Providing Web Information Services I</h1>
<h3> Steve Rozen</h3>
<h3><a href="../index.html">Genome Informatics</a></h3>
</center>
<a name="reference_materials"><h2>Reference Materials</h2></a<p>
Lincoln Stein, <i>How to Set Up and Maintain a Web Site, 2nd Ed.</i>,
HTML Quick Reference (next to last ...

</body>
</html>

Note the following:

Comments <!...>
Document "outline" <html><head><title>...</title></head><body>...</body></html>
Pairs of tags

<whatever> matches </whatever> (but not every <whatever> requires a </whatever>)

Paired tags must nest

E.g. <head><title>...</title></head> NOT <head><title>...</head></title>

Center format instruction <center>...</center>

Heading format instructions

E.g. <h1>Providing Web Information Services</h1>, <h2>...</h2>, etc.--distinct from <title>...</title>

Note: talk about the ~www/htdocs directory

Hyperlinks

The example above contains a hyperlink:

<a href="../index.html">Genome Informatics</a>

In this example the URL refers to the file index.html in the parent directory of to the current directory. (Files named index.html often have a special role; web servers are often configured so that e.g. the URL http://bush1/ refers to http://bush1/index.html.) The text Genome Informatics between <a href="../index.html"> and </a> is called an anchor, and gives the reader something to click on.

Hyperlinks can also refer to an entirely different web site, for example this is a link the Apache web site: <a href="http://www.apache.org">Apache</a> (Apache). The href can be either relative (i.e. a path relative to the protocol, host, and path of the current document), or can specify a full protocol, host, and (optional) path.

Other Essential HTML Tags

Some additional important basic HTML tags:

<p>
Start a new paragraph. (No need for </p>.)

<hr>
Print a "horizontal rule":
(No need for </hr>.)

<strong>...</strong>
Use a "strong" (e.g. bold) font, like this.

<pre>...</pre>
Leave the line breaks and whitespace the way they are. For example,
<pre>
    These
line       breaks
are not wrapped.
</pre>
gets presented like this:
    These
line       breaks
are not wrapped.

Images

To include an image such as the freehand web architecture diagram above, use, for example

<img src="webarch-vsmall.jpg" alt="[Freehand Sketch of Web Architecture]">

The image is just a file, in this case a "JPEG" image. You can even make a hyperlink from an image, for example, the HTML

<a href="webarch-med.jpg">
<img src="webarch-vsmall.jpg" align="middle"
     height=75 alt="[Freehand Sketch of Web Architecture]">
</a>

produces [Freehand Sketch of Web Architecture]. If you click on the image you get a bigger version of the image; the small image is the anchor. The target of the hyperlink can be anything, of course. For example

<a href="webarch.html">
<img src="webarch-vsmall.jpg" align="middle"
     height=75 alt="[Freehand Sketch of Web Architecture]">
</a>

produces [Freehand Sketch of Web Architecture], and if you click on it you get a new page HTML document with some text on web architecture.

Note that an image to be displayed in-line in the web page and a link to an image are different things in HTML. Of course you can also create a text anchor for a hyperlink to an image: For example Click here for a really big picture, coded by

<a href="webarch.jpg">Click here for a really big picture</a>

It is even possible to associate different hyperlinks with different parts of an image. See the discussion of clickable image maps in How to Set Up and Maintain a Web Site. (The HTML tags used for clickable image maps are <img>...</img>, <map>...</map> and <area>...</area>.)

As a first rule of thumb, do not get carried away with images. Some pages are basically a mosaic of images (and other byte-rich do-dads). (An example of what you might want to avoid is is Fox Kids: http://www.foxkids.com/index.asp.) Images can be big and take long time to download. Lots of people still need to use slow Internet connections. Developers, sitting at machines on the same LAN as the web server often seem to forget this. (Actually the Foxkids page is clever. It really starts with an eye-catching animated graphic http:/www.foxkids.com that loads pretty quickly, then automatically goes to the slower-loading page, http:/www.foxkids.com/index.asp.)

A subsequent lecture will cover other HTML formatting capabilities, including lists, tables, and forms.

Controlling Access To Your Web Site

We include a brief overview here because controlling access to your web data is sometimes essential. There are several approaches (which can be combined). The approaches that will be available to you will depend on how your site is administered, and all require you to get help from your site administrator.

  1. Rely on organizational firewall. (Drawbacks: Your pages might well be visible to many people within your organization. There is no way to let external collaborators see your pages.)

  2. Allow access only to browsers from particular IP addresses (for example only 18.157.0.217 or addresses that begin with 143.48.10). Drawbacks: often IP addresses are assigned dynamically to dialup clients or to Ethernet clients that use the DHCP protocol. PI's, vice presidents, senior collaborators, and their ilk are rarely able to determine their own IP address, so you cannot allow them access to the page that shows all the great work you did. Even worse, they will keep getting errors when they try to view the pages, leading them to the conclusion that you do not know what you are doing.

  3. Web servers can be configured so that the browser user must supply a user name and password before the server will return any pages. The user name/password facility can be built into the web server, or the web server can rely on a "security server".

  4. Unless special steps are taken, information transmitted over the Internet can be "sniffed" (read by programs that listen in on the transmissions). One approach to foiling eavesdropping is to use a web server that uses SSL (Secure Socket Layer). SSL encrypts Internet transmissions to and from the browser. For an introduction to SSL, see How to Set Up and Maintain a Web Site, 2nd ed., pages 228-240. More elaborate approaches, such as Virtual Private Networks (VPNs) are now becoming popular.

In addition, allowing web browsers to launch programs on your home machine via the web server can introduce the risk of outsiders viewing data they are not supposed to, or even of having them execute programs on the server that can do damage. We will discuss these risks in a subsequent lecture when we discuss "CGI" programming.

Workshop Problem Set

  1. You can use Netscape to view the HTML pages you are writing without going through the web server. In fact we recommend this while you are creating web pages, because Netscape does not cache a file that it reads from the file system, whereas if you view a file via a web server Netscape will likely cache it, so you might not see your changes until you clear your cache (or turn off caching).

    Create a text/plain document called EUREKA.txt in your public_html directory and view it in your browser as a file. (In Netscape, go to "File", "Open Page...", "Choose File...", and then choose EUREKA.txt in the file chooser.) When prompted, choose "Open in Navigator".

  2. Now use your browser to view your text/plain via the web server (http://bush1/~your_user_name/EUREKA.txt).

  3. You can use your browser to get the source HTML from other people's web documents. Use your browser to "View", "Page Source" for this document (the document you are viewing right now). Then, save a copy of this document as a file to your public_html directory. (Call it MY_info_svcs.html.) Edit it to remove the line
    <link rel="stylesheet" href="./standard.css">
      

    (near the top). Also delete this problem. Then view the new file in your browser (as a file) to make sure it looks OK. Once it looks OK put the modified page in your web site and view it via the web server (http://bush1/~your_user_name/MY_info_svcs.html). What happens to the images? Why?

  4. Use Netscape to download the picture of Cold Spring Harbor Laboratory at the top of the page http://www.cshl.org. Insert it into the top of http://bush1/~your_user_name/MY_info_svcs.html.

  5. Create your own file http://bush1/~your_user_name/index.html. If you use xemacs it will likely prompt you for the document title and then create a HTML document template including <html><head><title>...</title></head><body>...</body></html> as well as other stuff for you. Otherwis type that yourself. Try the various HTML constructs discussed in this lecture. Create a link to the main course page and to ~/public_html/MY_info_svcs.html. Create one link that uses a full URL (with protocol and host portions), and one "relative" URL that refers to a file in your public_html directory or a subdirectory.


Genome Informatics
Steve Rozen, rozen@gaiberg.wi.mit.edu
Whitehead Institute for Biomedical Research
Last modified: Mon Oct 21 18:47:36 EDT 2002