        SWISH 1.2.1

     Configuration

     ------------------------------------------------------------------
                 [Index] [Previous Chapter] [Next Chapter]

        * The SWISH configuration file
             o Basic index variables
             o Using ReplaceRules
             o Using FileRules
             o Advanced indexed variables
             o Building configuration files
        * Command line options

     SWISH configuration is done at three levels - compile time, the
     configuration file, and with command line options. This chapter
     focuses on the configuration file, and touches on the command line
     options. For information on compile time configuration, see the
     Installation Guide.

     Future releases of SWISH will hopefully provide configuration
     capability to override all compile times options.

     The SWISH configuration file

     You can specify variables and values in the configuration file by
     typing the variable name (it's not case sensitive), a space (tabs
     are OK), and the value you want for the variable. If the value has
     spaces, you can enclose it in quotes to keep the space. If you
     want to specify multiple values, separate the values with a single
     space. In the configuration file, lines beginning with a hash mark
     (#) and blank lines are ignored.

     The configuration variables are:

        * AsciiEntities
        * DefaultRule
        * DocUrl
        * EmphasizeComments
        * EmphasizeMetaTags
        * FileRules
        * FollowSymLinks
        * IgnoreAllC
        * IgnoreAllN
        * IgnoreAllV
        * IgnoreLimit
        * IgnoreRowC
        * IgnoreRowN
        * IgnoreRowV
        * IgnoreSame
        * IgnoreWords
        * IndexAdmin
        * IndexDescription
        * IndexDir
        * IndexFile
        * IndexName
        * IndexOnly
        * IndexPointer
        * IndexReport
        * IndexTags
        * MaxHits
        * MaxWordLimit
        * MinWordLimit
        * NoContents
        * ReplaceRules
        * TitleTopLines

     ------------------------------------------------------------------

     Basic index variables

        * IndexDir directory

          The IndexDir variable tells swish what directories and files
          to index. Each specified directory will be indexed
          recursively. You can use more than one of these directives -
          here are some examples:

              IndexDir /usr/local/www /src/code.html
              IndexDir /users/tony/public_html/home.html /web

          There is no default.

        * IndexFile indexfile

          The IndexFile variable tell swish what to save the indexed
          results as. Indexes generated by swish should have a suffix
          of .swish.

          The default is (in some cases) index.swish but you should
          generally specify a file either via this directive ior on the
          command line (see below).

        * IndexOnly .suffix1 .suffix2 .suffix3 ...

          Only files with these suffixes will be indexed. If you omit
          this variable, swish will index every file it comes across.
          Suffix checking is not case sensitive.

        * IndexReport value

          This variable specifies the verbosity level while SWISH is
          indexing. It can take a numerical value from 0 to 3. Specify
          0 for completely silent operation and 3 for detailed reports.
          If no value is given then 0 is assumed.

          verbosity 0 - silent running

          Level 0 is silent for normal operation. Only errors are
          reported.

          verbosity 1 - normal details

          Level 1 lets you know the bare statistics, as in the
          following (real) example from http://www.rru.com/ :

          Removing very common words... no words removed.
          Writing main index... 10611 unique words indexed.
          Writing file index... 408 files indexed.
          Running time: 15 seconds.
          Indexing done!

          verbosity 2 - directory-level commentary

          Level 2 is the same as level 1, except that as swish
          traverses the directories to be indexed, it reports each
          directory it enters. Here's a partial, real-life example from
          the same website:

          Checking dir "/www/pages/rru"...
          Checking dir "/www/pages/rru/Images"...
          Checking dir "/www/pages/rru/RNN"...
          Checking dir "/www/pages/rru/RNN/Morgue"...
          Checking dir "/www/pages/rru/RNN/NS"...
          Checking dir "/www/pages/rru/RNN/Propoganda"...

          verbosity 3 - detailed commentary

          Level 3 gives all the information of level 2, plus commentary
          as it indexes each file, telling how many unique words it
          found in that file. Here's another partial, real-life example
          from the same website:

          In dir "/www/pages/rru/RNN":
            headlines.html (168 words)
            index.html (202 words)
            morgue.html (270 words)

          In dir "/www/pages/rru/RNN/Spumoni":
            current.html (624 words)
            current.txt (740 words)

          In dir "/www/pages/rru/RNN/Spumoni/Extra":
            staff.html (51 words)
            subs.html (59 words)

          In dir "/www/pages/rru/RNN/Spumoni/Morgue":
            001.html (755 words)
            002.html (288 words)
            003.html (705 words)
            004.html (309 words)
            005.html (1124 words)
            005s.html (397 words)
            007a.html (425 words)
            008.html (559 words)
            009.html (811 words)

        * MaxHits value

          SWISH can return any number of hits, but MaxHist defines the
          default cut off limit for a given index.

          The built-in default is 50.

        * FollowSymLinks yes/no

          Normally swish ignores symbolic links to files whe indexing.
          If you want it to follow such links, define this value as
          yes, else define it as no.

          The default is no.

        * NoContents .suffix1 .suffix2 .suffix3 ...

          This variable lets you control what files will have their
          contents indexed. If a file with a suffix in this list is
          indexed, only its file name (and not any words in the file)
          will be indexed. This is useful because normally SWISH will
          try to index the contents of every file, even files without
          words (such as images or movies). Suffix checking is
          case-insensitive.

        * IgnoreWords word1 word2 ...

          Here you can specify words to ignore when searching. Usually
          these words (called stopwords are words that occur so often
          in your data that indexing them is not worthwhile. If you
          specify a word as SwishDefault, it will be replaced with
          swish's default list - a few hundred very common English
          words.

        * IgnoreLimit number1 number2

          After indexing, swish can automatically tell which words are
          the most common and omit them from the index according to
          these parameters. Here are some examples:

              1. IgnoreLimit 80 256
              2. IgnoreLimit 50 50

            1. Swish will ignore all words that occur in over 80% of
               the files and that also occur in over 256 different
               files.
            2. Swish will ignore all words that occur in over 50% of
               the files and that also occur in over 50 different
               files.

          Using IgnoreLimit and IgnoreWords can help trim the size of
          your index files considerably - experiment with parameters to
          see what works best at your site. You can also use
          IgnoreLimit to limit the CPU resources that searches take.

        * IndexName "value"
        * IndexDescription "value"
        * IndexPointer "value"
        * IndexAdmin "value"
        * DocUrl "value"

          These variables specify information that goes into index
          files to help users and administrators. IndexName should be
          the name of your index, like a book title. IndexDescription
          is a short description of the index or a URL pointing to a
          more full description. IndexPointer should be a pointer to
          the original information, most likely a URL. IndexAdmin
          should be the name of the index maintainer and can include
          name and email information. These values should not be more
          than 70 or so characters and should be contained in quotes.
          Note that the automatically generated date in index files is
          in D/M/Y and 24-hour format. DocUrl overrides the default URL
          of the docs. If you install them locally, you can set the URL
          here.

     ------------------------------------------------------------------

     Using ReplaceRules

     When results are returned from swish searches, you may get a bunch
     of funny pathnames to files that you can't access. Using
     ResultRules, you can specify a series of operations to perform on
     the pathname result to change it into a URL and other things if
     you desire.

     There are three operations you can specify: replace, append, and
     prepend. They will parse the pathname in the order you've typed
     these commands. More than one command and its arguments can appear
     on the same line, but it's easier to read when commands are broken
     up over a few lines. You can't put a command and its argument(s)
     on different lines, however.

     Here's the syntax:

         ReplaceRules replace "the string you want replaced" "what to change it to"
             This replaces all occurrences of the old string
             with the new one.
         ReplaceRules prepend "a string to add before the result"
         ReplaceRules append "a string to add after the result"

     Study the sample configuration file in Appendix C and try things
     out. You'll find that by having swish return URLs instead of
     pathnames, you can create interfaces to swish that can allow users
     to get to the search results over the World-Wide Web.

     ------------------------------------------------------------------

     Using FileRules

     You can specify certain file directives in the configuration
     file - any files or directories matching these criteria will be
     ignored and will not be indexed. Append all of these operations to
     a FileRules directive:

        * pathname contains string1 string2 string3 ...

          Any path names containing exactly these strings, whether they
          be paths to directories or paths to files, will be ignored.
          Using this you can avoid indexing temporary directories or
          private material.

        * filename is filename

          Any file name exactly matching the specified file name will
          be ignored (this is case-sensitive). This cannot be a path.

        * filename contains string1 string2 string3 ...

          Any file name containing these strings will be ignored (this
          is not case-sensitive). This cannot be a path.

        * title contains string1 string2 string3 ...

          Any HTML file with a title that contains these strings will
          be ignored (this is case-insensitive).

        * directory contains string1 string2 string3 ...

          Any directory that contains any of these specified file names
          will be ignored (this is case-insensitive).

     ------------------------------------------------------------------

     Advanced indexed variables

     These variables affect indexing by either changing the weight of a
     word, or by changing what constitutes a word. The defaults should
     be fine for most people. See the FAQ for suggestions on using
     these to fine tune your configuration.

        * EmphasizeComments yes/no

          Defining this value as yes tells swish to emphasize words
          found in HTML comments when indexing. This means that words
          in comments will have a heavier weight when determining a
          file's score for the word[s]. Defining this value as no tells
          swish to treat words in comments the same as any other words.

          The default is no.

        * EmphasizeMetaTags yes/no

          Defining this value as yes tells swish to emphasize words
          found in META tags when indexing. This means that words in
          META tags will have a heavier weight when determining a
          file's score for the word[s]. Defining this value as no tells
          swish to treat words in META tags the same as any other
          words.

          The default is yes.

        * DefaultRule and/or

          This defines the default rule (and or or) to apply when
          multiple search words are given with no boolean operator.

          The default is and.

        * MinWordLimit value

          This defines the minimum word length in characters.

          The default is 3.

        * MaxWordLimit value

          This defines the maximum word length in characters.

          The default is 20.

        * TitleTopLines value

          This defines how deeply (in lines) swish will search for a
          TITLE tag.

          The default is 8.

        * AsciiEntities yes/no

          If this is set to yes swish converts HTML ASCII entity names
          to their closest ASCII equivalent (for instance,
          resum&eacute; would become resume).

          Regardless of this setting, swish will convert numeric
          entities to their closest ASCII equivalents ((for instance,
          resum&#233; would become resume).

          The default is yes.

        * IndexTags yes/no

          Normally, all data in tags in HTML files (except for words in
          comments and META tags) is ignored. If you want to index HTML
          files with the text within tags and all, define this to be
          yes. Only in rare cases should this be set to no.

          The default is no.

        * IgnoreAllV yes/no

          If set to yes swish will ignore words consisting only of
          vowels.

          The default is yes.

        * IgnoreAllC yes/no

          If set to yes swish will ignore words consisting only of
          consonants.

          The default is yes.

        * IgnoreAllN yes/no

          If set to yes swish will ignore words consisting only of
          numbers.

          The default is yes.

        * IgnoreRowV value

          SWISH will ignore words with more than value consecutive
          vowels.

          The default is 3.

        * IgnoreRowC value

          SWISH will ignore words with more than value consecutive
          consonants.

          The default is 4.

        * IgnoreRowN value

          SWISH will ignore words with more than value consecutive
          numbers.

          The default is 3.

        * IgnoreSame value

          SWISH will ignore words with more than value consecutive,
          identical characters.

          The default is 3.

     ------------------------------------------------------------------

     Building configuration files

     There are three basic ways to create configuration files:

       1. the ez-swish web page interface (basic, user-level,
          configuration files)
       2. the mkswishconf script (prompts for all configuration
          parameters)
       3. roll your own (copy a sample or in-use config file, tweak it)

     Each of these has a targeted user base, as explained in the
     descriptions below.

       1. EZ swish web page interface

          The first method is useful when users want to create their
          own indexes, and don't care about heavy customization of the
          indexing algorithm. The web page scripts provide reasonable
          defaults for most things, and let the user fill in a minimum
          of information. Since the web server cannot, by default,
          write into a user's directory, the config file is written to
          a known location (/tmp/$LOGNAME-swish.conf), and the user is
          prompted to move it into the proper directory. If the user
          later desires, they can read this (self-documenting) file and
          play with the various configuration parameters.

          To use this, the ez-swish*cgi scripts should be installed in
          a known, public place, such as http://wherever/cgi-bin/ . The
          user points a web browser at
          http://wherever/cgi-bin/ez-swish.cgi and fills in the few
          blanks there. To get reasonable defaults, the user actually
          accesses http://wherever/cgi-bin/ez-swish.cgi?username where
          username is the user's login name. Ths script, as delivered,
          assumes the user's web pages will be in $HOME/public_html .

       2. mkswishconf

          The mkswishconf script guides a user through creation of a
          fully customized swish configuration file. The script is
          invoked with a single parameter, the path of the config file
          to be created. Examples include:

              mkswishconf /www/httpd/conf/swish/acctg.conf
              mkswishconf ./swish.conf

          This is most useful for people who don't know, or have
          confidence in, any local, UNIX-based editor, or people who
          need extra handholding.

       3. Roll your own

          Most people simply copy one of the example files (in
          swish/Conf/) and edit that. Since the files are fairly well
          documented, this will work for most people, especially after
          reading this manual.

     ------------------------------------------------------------------

     Command line options

     Running SWISH with the -z, -h or -? options shows us the
     command-line options (and a couple of other things we aren't
     worried about here). This chapter briefly covers configuration
     options; other options devoted only to searching or indexing are
     covered in the Users Guide.

     Ways to invoke swish:

         swish [-i dir file ... ] [-c file] [-f file] [-l] [-v [num]] [-e] [-E]
         swish -w word1 word2 ... [-f file1 file2 ...] [-m num] [-t str]
         swish -M index1 index2 ... outputfile
         swish -D file
         swish -V

     Options (defaults are in brackets):

         -i : create an index from the specified files
         -w : search for words "word1 word2 ..."
         -t : tags to search in - specify as a string
              "HBthec" - in head, body, title, header,
              emphasized, or comments
         -f : index file to create or search from [index.swish]
         -c : configuration file to use for indexing
         -l : follow symbolic links when indexing
         -v : verbosity level (0 to 3) [0]
         -e : emphasize words in comments when indexing
         -E : emphasize words in META tags when indexing
         -m : the maximum number of results to return [50]
         -M : merges index files
         -D : decodes an index file
         -V : prints the current version

     -m (number) (number of results)

     While searching, this specifies the maximum number of results to
     return. The default is 50. If no numerical value is given, the
     default is assumed. If the value is 0 or the string all, there
     will be no limit to the number of results. There is no
     corresponding MaxHits parameter in the configuration file.

     -i directory file ... (files to index)

     This specifies the directories and/or files to index. Directories
     will be indexed recursively. This overrides IndexDir in the
     configuration file.

     There is no default.

     -c configfile ... (configuration file)

     This specifies the configuration file to use for indexing or
     searching. You can use this as an only option to swish to do
     automatic indexing, if all the necessary variables are set in the
     configuration file.

     Many parameters in the configuration file may also be overridden
     by other command line options.

     You can specify multiple configuration files in order to split up
     common preferences. For instance, you might store a file with the
     stopwords in it and have multiple other files that have different
     index file information.

       example 1: swish -c swish.conf
       example 2: swish -i /usr/local/www -f index.swish -v -c swish.conf
       example 3: swish -c swish.conf stopwords.conf

     Notes on examples:

       1. The settings in swish.conf will be used to index the pages
          defined therein. Therefore all necessary parameters must be
          defined in swish.conf .
       2. The command-line options override the corresponding
          parameters in the configuration file.
       3. The variables in swish.conf will be read, then the variable
          in stopwords.conf will be read. Note that if the same
          variables occur in both files, older values may be added to,
          written over or ignored, depending on the parameter.

     You can also use the same configuration file for multiple indexes,
     by specifying common parameters in the file and differing
     parameters (such as directories to index and the index file
     location) on the command line.

       example :
           swish -c /www/httpd/conf/swish/users.conf \
               -f /u/meo/public_html/index.swish  -i /u/meo/public_html
           swish -c /www/httpd/conf/swish/users.conf \
               -f /u/sbo/public_html/index.swish  -i /u/sbo/public_html

     These commands generate indexes for two separate user's webs, with
     all other parameters in common. This requires, of course, an
     IndexTitle nebulous enough to work for both. Again, you could also
     use multiple configuration file - one per user and a common file
     with overall settings.

     -f indexfile1 (index file to create)
     -f indexfile1 indexfile2 ... (index file[s] to search)

     If you are indexing, this specifies the file to save the generated
     index in, and you can only specify one file. If you are searching,
     this specifies the index files (one or more) to search from. The
     default index file is index.swish in the current directory.

     -l (symbolic links)

     Specifying this option tells swish to follow symbolic links when
     indexing. This overrides the FollowSymLinks configuration file
     parameter. The default is no.

     -e (emphasize words in comments)

     Specifying this option tells swish to emphasize words found in
     HTML comments when indexing. This means that words in comments
     will have a heavier weight when determining a file's score for the
     word[s]. This overrides the EmphasizeComments configuration file
     parameter. The default is no.

     -E (emphasize words in META tags)

     Specifying this option tells swish to emphasize words found in
     META tags when indexing. This means that words in META tags will
     have a heavier weight when determining a file's score for the
     word[s]. This overrides the EmphasizeMetaTags configuration file
     parameter. The default is yes.

     -M indexfile1 indexfile2 indexfile3... (index merging)

     This allows you to merge two or more index files - the last file
     you specify on the list will be the output file. Merging removes
     all redundant file and word data. To estimate how much memory the
     operation will need, sum up the sizes of the files to be merged
     and divide by two. That's about the maximum amount of memory that
     will be used. You can use the -v option to produce feedback while
     merging and the -c option with a configuration file to include new
     administrative information in the new index file.

     There are no defaults for this option.

     -D [indexfile] (decode)

     This option is provided so you can check the word, file, and
     maintenance information in index files. You can specify multiple
     files to decode.

     There is no default for this option.

     -v [number] (verbosity option)

     The -v option can take a numerical value from 0 to 3. Specify 0
     for completely silent operation and 3 for detailed reports. This
     overrides the IndexReport configuration file parameter. If no
     number is specified, the verbosity is set to 0.

     verbosity 0 - silent running

     Level 0 is silent for normal operation. Only errors are reported.

     verbosity 1 - normal details

     Level 1 lets you know the bare statistics, as in the following
     (real) example from http://www.rru.com/ :

     Removing very common words... no words removed.
     Writing main index... 10611 unique words indexed.
     Writing file index... 408 files indexed.
     Running time: 15 seconds.
     Indexing done!

     verbosity 2 - directory-level commentary

     Level 2 is the same as level 1, except that as swish traverses the
     directories to be indexed, it reports each directory it enters.
     Here's a partial, real-life example from the same website:

     Checking dir "/www/pages/rru"...
     Checking dir "/www/pages/rru/Images"...
     Checking dir "/www/pages/rru/RNN"...
     Checking dir "/www/pages/rru/RNN/Morgue"...
     Checking dir "/www/pages/rru/RNN/NS"...
     Checking dir "/www/pages/rru/RNN/Propoganda"...

     verbosity 3 - detailed commentary

     Level 3 gives all the information of level 2, plus commentary as
     it indexes each file, telling how many unique words it found in
     that file. Here's another partial, real-life example from the same
     website:

     In dir "/www/pages/rru/RNN":
       headlines.html (168 words)
       index.html (202 words)
       morgue.html (270 words)

     In dir "/www/pages/rru/RNN/Spumoni":
       current.html (624 words)
       current.txt (740 words)

     In dir "/www/pages/rru/RNN/Spumoni/Extra":
       staff.html (51 words)
       subs.html (59 words)

     In dir "/www/pages/rru/RNN/Spumoni/Morgue":
       001.html (755 words)
       002.html (288 words)
       003.html (705 words)
       004.html (309 words)
       005.html (1124 words)
       005s.html (397 words)
       007a.html (425 words)
       008.html (559 words)
       009.html (811 words)

     -V (version option)

     The -V option causes swish to spit out its version number. The
     result looks like this:

       swish 1.2.1

                 [Index] [Previous Chapter] [Next Chapter]
     ------------------------------------------------------------------
     Last update: 18/Aug/1998
