Mosaic And WAIS Tutorial

Marc Andreessen
National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign
marca@ncsa.uiuc.edu

Introduction

This tutorial surveys the current methods for using WAIS as a server and Mosaic as a client in a powerful, flexible, integrated, and open information system.

Use of the following software is assumed:

CNIDR's freeWAIS version 0.202 or later.
NCSA Mosaic for X version 2.0 (with direct WAIS support via linking to client libraries from freeWAIS 0.202 or later).

Please note that this tutorial is a free-form exposition of my experience with Mosaic and WAIS -- although that experience has spanned many months and although I have tested everything I outline below, there may be factual errors or incorrect assumptions at any point. Please drop me a note if you notice any mistrakes.

Pieces of the Puzzle

Mosaic provides sophisticated client-side network information retrieval, display, and query capabilities via a user-friendly graphical interface.

WAIS provides advanced server-side search and retrieval capabilities, including support for binary datatypes and very fast searches of the entire contents of large textual databases.

Use Mosaic as a front-end client and WAIS as a back-end server and you can provide your users with a friendly yet powerful window into your information universe and sophisticated query, retrieval, and indexing capabilities.

Creating a WAIS Database

The following example can be done entirely without root privileges. If at all possible you should try to follow the exact sequence of steps on your own system; once you've been through it once, you'll have a good grasp of how to make WAIS work for you.

Download and install the freeWAIS 0.202 (or later) distribution from the UNC SunSITE FTP server. Installation instructions are in the file INSTALLATION in the freeWAIS distribution.

You can place data files of any type in a WAIS database; possibilities include HTML documents, plaintext documents, GIF images, audio files, and so on. In the following example, we will assume there will at least be HTML documents in the WAIS database, and possibly other types of files as well.

Create a directory (e.g. ~/fluff) and put copies of all the files you wish to place in the database in that directory. Make sure they all have relevant extensions (e.g. ".html" for HTML documents, ".gif" for GIF images) to make life easy for you in the short term.

Create a directory (e.g. ~/localwais/sources) to hold the WAIS index file for your database. This index file will be created automatically by the WAIS indexing program, waisindex, and will be consulted by the WAIS server program, waisserver, when clients ask the WAIS database for query information or specific documents.

Create and run a shell script (call it doindex) that will index all of the files in ~/fluff and place the resulting index file in ~/localwais/sources. The following is such a shell script:

#!/bin/csh

# Go to the directory with the documents to be indexed.
cd ~/fluff

# Create index, initially with HTML documents.
waisindex -export -d ~/localwais/sources/marc -T HTML *.html

# Add plaintext documents to index.
waisindex -a -d ~/localwais/sources/marc -T TEXT *.txt
# Add PostScript documents to index -- index contents, why not?
waisindex -a -d ~/localwais/sources/marc -T PS *.ps

# The following types are all indexed without contents 
# (thus use of the -nocontents flag).  So all you can
# do is search on filenames...

# Add GIF images to index.
waisindex -a -d ~/localwais/sources/marc -T GIF -nocontents *.gif
# Add RGB images to index.
waisindex -a -d ~/localwais/sources/marc -T RGB -nocontents *.rgb
# Add HDF data files to index.
waisindex -a -d ~/localwais/sources/marc -T HDF -nocontents *.hdf
# Add audio files to index.
waisindex -a -d ~/localwais/sources/marc -T AU -nocontents *.au

Some things to note about the above shell script:

We make repeated calls to waisindex, the program that looks at files you are adding to a database and adds information about them to the database's index file. The information built up in that index file is used to allow very fast searches to be made across the entire contents of the files in the database.
The first call to waisindex uses the -export flag, which specifies that the database we're creating is to be made available over the network (the actual effect is to make sure that the database has a reasonable name).
Subsequent calls to waisindex use the -a flag to tell the indexer to add to an existing index rather than creating a new index. (The first call to waisindex created a new index.)
The -d ~/localwais/sources/marc arguments to waisindex tell the indexer what the name of the index should be. Since a single WAIS server can serve multiple WAIS indexes (databases), all the indexes are commonly kept in a single directory (in this case, ~/localwais/sources) and each index is given a distinct name (in this case, marc).
Each call to waisindex uses the -T flag to specify the type of the files being indexed at that time.
WAIS types have historically been ad hoc but straightforward -- TEXT for text files, GIF for GIF images, etc. Mosaic recognizes these ad hoc types using a method that the author thinks is actually pretty damn slick -- a WAIS type retrieved as the result of a query is matched to a MIME type as though it were a file extension.
In other words, since a file with extension ".text" is normally considered plaintext (MIME type text/plain) by Mosaic, a WAIS query result of WAIS type TEXT is also considered text/plain.
Similarly, if Mosaic were configured to recognize file extension ".foo" as MIME type application/x-foo, a WAIS query result of WAIS type FOO would also be considered of type application/x-foo.
(Note: At some point in the future, WAIS will start using MIME types directly. Mosaic supports this already: if a WAIS type corresponds to a MIME type that Mosaic understands, then Mosaic will recognize that and act appropriately.)
The -nocontents flag is used while indexing binary filetypes for which it would make no sense to actually index the contents. (E.g., indexing a GIF file's binary contents would do nothing useful.) Use of the -nocontents flag means that only the filename for each file being indexed is added to the index.
waisindex can be made recursive -- files in subdirectories will be indexed also -- via the -r flag (which we don't use in this example).

To run waisserver -- the WAIS server program -- and therefore make your new index available to Mosaic clients over the network, construct and run a shell script (call it doserve) that looks like this:

#!/bin/csh

# Go to the directory containing the WAIS sources.
cd ~/localwais/sources

# Start the WAIS server in standalone mode; 
# have it use port 2010.
waisserver -p 2010 &

You now have a running WAIS server.

The URL for connecting to the server from Mosaic is:

    wais://machine:2010/marc

In this URL, machine is the name of the system on which you are running the WAIS server. 2010 is the port you chose to run the WAIS server on, and marc is the name you gave the WAIS database.

When you do a query on your new database, the resulting URL will look like this:

    wais://machine:2010/marc?query

... where query is the search string you enter.

Mosaic, WAIS, and Gateways

Historically, World Wide Web clients like Mosaic have accessed WAIS servers through a gateway.

A WAIS gateway, in this context, is a server that accepts a query from a Web client via HTTP, issues a query to a WAIS database on behalf of the client, post-processes the results of querying the WAIS database, and returns the information to the Web client (again via HTTP). The purpose of this is to provide access to WAIS databases by clients that do not speak the WAIS protocol natively.

With Mosaic 2.0 and some of the other more advanced Web clients coming along now, the rules are changing, since it is now possible to have the same client capable of accessing both the normal range of Web servers (HTTP, Gopher, FTP, NNTP) as well as WAIS servers, without requiring a gateway at any stage of the information retrieval process.

But, many Web clients still don't have native WAIS support -- two good examples are NCSA Mosaic for the Mac version 1.0 and NCSA Mosaic for Microsoft Windows version 1.0. Those clients still must go through a WAIS gateway, as must any instance of Mosaic for X version 2.0 that isn't compiled with native WAIS support.

The big catch here is that, at the present time, the WAIS gateways available on the network don't do a good job of providing full access to WAIS databases. In particular, access to anything other than plain text files is likely not work, and multiformat query responses (see below) will not work.

The solution is to write a better WAIS gateway, probably based on the native WAIS support in Mosaic 2.0. We'll probably do that at some point, but it isn't done yet (that I know of).

So what do you do if you want to provide WAIS databases to people using various Web clients, some of which don't support native WAIS?

Web clients without native WAIS access should be set up to automatically use one of the public WAIS gateways (probably either NCSA's or CERN's) to handle wais URLs.

Mosaic for X version 1.2 and earlier did not do this properly, for which we are ashamed, but Mosaic for X 2.0 will do this properly if it's not compiled with direct WAIS support.

What this means is that a wais URL that looks like the following:

    wais://cnidr.org:210/directory-of-servers

... should be automatically converted to a URL that looks something like the following:

    http://www.ncsa.uiuc.edu:8001/cnidr.org:210/directory-of-servers

Note that www.ncsa.uiuc.edu:8001 is the address of the public NCSA WAIS gateway; everything after the first single slash in this URL is exactly the same as in the original wais URL. This should give the gateway all the information it needs to access the specified WAIS database and provide the non-native-WAIS client with the equivalent of direct access, with a minor performance hit.

So, that's a stopgap solution that will provide transparent access to at least text files in WAIS databases by a wide range of Web clients.

One final note: If you happen to be using a Web client that is lacking both native WAIS support and the ability to automatically feed wais URLs through a gateway, your remaining option is to explicitly use the http form of wais URLs as shown above. This is not a good solution and hopefully it won't ever be necessary in the very near future.

Indexing Existing Hierarchies of HTML Documents

More and more frequently, people are using WAIS to index existing hierarchies of HTML documents and associated text documents, images, audio clips, animations, etc.

A big problem here is that, using WAIS as it currently exists as the search and retrieval engine for existing sets of HTML documents, any and all relative links and relative pointers to inlined images in all indexed HTML documents will break.

Why is this? Well, when you retrieve an HTML document from a WAIS server, the URL corresponding to that document will be an encoded WAIS "docid", or document identifier. This docid is not the same thing as the path and filename of the file that you're retrieving. (In fact, it looks like a horribly mangled stream of random and spurious bytes -- its structure and meaning are definitely not transparent at the user level.)

So, when an HTML document contains a relative link or inlined image pointer, the document is pulled over via WAIS, and Mosaic tries to resolve the relative link into an absolute URL by combining it with the URL for the current document ... -- well, it just don't work.

One near-term but generally undesirable solution is to always use absolute URLs for hyperlinks and inlined images in all HTML documents on your server.

The real solution is for HTTP servers (which, of course, commonly use URLs that correspond exactly to directory and file names and therefore allow relative links to freely work) to use WAIS as a search engine only -- and to make sure that URLs given to browsers as the results of searches are exactly normal http URLs.

This is completely technically possible and will be more and more common in the very near future. An experimental WAIS back-end interface that provides this functionality is known to exist for Plexus, and either that interface or something similar will eventually be made available for NCSA httpd (and presumably other HTTP servers). I'll attempt to stay up to date on the progress of these efforts and roll the results of ongoing work into this tutorial.

One more thing: WAIS is evolving towards greater separation of indexing and retrieval. It should eventually be possible to have WAIS itself return arbitrary URLs (matching, say, the actual directory and file names of files it indexes), which would allow relative links to work. This is an intriguing idea because it would mean that you could potentially run an entire standard Web server entirely with a single WAIS server.

(See experimental information on integrating WAIS and HTTP servers.)

Multiformat Query Responses

CNIDR's freeWAIS 0.202 (or later) has the ability to index multiple files of varying types under the same umbrella in such a way that a user query may, for example, be made by searching a set of text files, but a response will consist of a matching text file plus a GIF image plus an audio file.

This is a useful capability if, for example, you have a set of images, each of which has a corresponding text description. You can set up your WAIS database in such a way that the text descriptions are searched, but appropriate images are given to the user as a result of successful search hits in the text descriptions.

The following describes how to set up a WAIS server to return multiformat responses. We'll assume you're using the doindex script and directory structures as given in the examples above.

Create a directory called ~/multifluff. This is where you'll put all files to be indexed with WAIS's multiple format support.

A condition of freeWAIS's multiformat support is that the various files follow certain file name and extension conventions very closely.
We'll assume, for this example, that you have a set of text files; each text file has either an associated GIF image, an associated PostScript document, or both a GIF image and a PostScript document.
You will give all the text files the extension ".TEXT", all the GIF files the extension ".GIF", and all the PostScript files ".PS". Note use of uppercase.
It is assumed that related files have the same name, with the exception of the extension -- in other words, "foobar.TEXT" and "foobar.GIF" will be considered to be related. "blargh.TEXT" and "blorf.GIF", however, will not.

Place the various text files and associated GIF and PostScript files in ~/multifluff. Be sure they have appropriate filenames and extensions, as described above: filenames match for related files; extensions are ".TEXT", ".GIF", and ".PS".

Add the following lines to the end of your doindex script:

# Go to the directory containing the files in multiple formats.
cd ~/multifluff

# Index *.TEXT and associate *.GIF and *.PS.
waisindex -a -d ~/localwais/sources/marc -T TEXT -M TEXT,GIF,PS *.TEXT

Note the use of the -M argument to waisindex: the types in the comma-separated list following -M are used by the indexer to determine how to tie different files in ~/multifluff together. A given query will be able to return a matching TEXT file as well as an associated GIF image (if one exists with the same filename and extension ".GIF") and an associated PS document (if one exists with the same filename and extension ".PS").

Example: Here's an example set of files that you might place in ~/multifluff:

  crufty.GIF
  crufty.TEXT
  maybe-marc.GIF
  maybe-marc.PS
  maybe-marc.TEXT
  tarot.PS
  tarot.TEXT

After you index these files as described above, a query on "crufty" should return a hit corresponding to "crufty.TEXT". When you access that hit, Mosaic should tell you that you are at a "Multiple Format Opportunity" and present you with a menu from which you can choose TEXT, GIF, or PS.

Works for me! :-)

Important Note: Mosaic for X version 2.0 compiled with direct WAIS support is the only Web client known to actually handle multiformat responses. The modifications we made to the common Web library WAIS code to make this happen should be easy to roll into other clients, but to our knowledge no one has yet done so, and certainly no gateways will be able to handle multiformat responses.

However, this is a quite powerful capability, and if you are able to assume use of Mosaic for X, we certainly suggest you give it a shot and see if it works for you.

Assorted Closing Notes

You can re-index a WAIS database at any time, without having to restart waisserver.
If you have questions about using Mosaic with WAIS and you have read this tutorial in detail, please send email to mosaic-x@ncsa.uiuc.edu and we'll try to help you. You can also post questions to the Usenet newsgroup comp.infosystems.www, which the Mosaic authors read.
If you have questions about WAIS in general, or questions concerning the creation and maintenance of WAIS databases, try posting to the Usenet newsgroup comp.infosystems.wais.