SWISH 1.2.1
Configuration
[Index] [Previous Chapter] [Next Chapter] SWISH configuration is done at three levels - compile time, the configuration file, and with command line options. This chapter focuses on the configuration file, and touches on the command line options. For information on compile time configuration, see the Installation Guide.
Future releases of SWISH will hopefully provide configuration capability to override all compile times options.
The SWISH configuration file
You can specify variables and values in the configuration file by typing the variable name (it's not case sensitive), a space (tabs are OK), and the value you want for the variable. If the value has spaces, you can enclose it in quotes to keep the space. If you want to specify multiple values, separate the values with a single space. In the configuration file, lines beginning with a hash mark (#) and blank lines are ignored.The configuration variables are:
- AsciiEntities
- DefaultRule
- DocUrl
- EmphasizeComments
- EmphasizeMetaTags
- FileRules
- FollowSymLinks
- IgnoreAllC
- IgnoreAllN
- IgnoreAllV
- IgnoreLimit
- IgnoreRowC
- IgnoreRowN
- IgnoreRowV
- IgnoreSame
- IgnoreWords
- IndexAdmin
- IndexDescription
- IndexDir
- IndexFile
- IndexName
- IndexOnly
- IndexPointer
- IndexReport
- IndexTags
- MaxHits
- MaxWordLimit
- MinWordLimit
- NoContents
- ReplaceRules
- TitleTopLines
Basic index variables
- IndexDir
directory
The IndexDir variable tells swish what directories and files to index. Each specified directory will be indexed recursively. You can use more than one of these directives - here are some examples:
IndexDir /usr/local/www /src/code.html IndexDir /users/tony/public_html/home.html /webThere is no default.
- IndexFile
indexfile
The IndexFile variable tell swish what to save the indexed results as. Indexes generated by swish should have a suffix of
.swish
.The default is (in some cases)
index.swish
but you should generally specify a file either via this directive ior on the command line (see below).
- IndexOnly
.suffix1 .suffix2 .suffix3 ...
Only files with these suffixes will be indexed. If you omit this variable, swish will index every file it comes across. Suffix checking is not case sensitive.
- IndexReport
value
This variable specifies the verbosity level while SWISH is indexing. It can take a numerical value from
0
to3
. Specify0
for completely silent operation and3
for detailed reports. If no value is given then0
is assumed.verbosity 0 - silent running
Level 0 is silent for normal operation. Only errors are reported.verbosity 1 - normal details
Level 1 lets you know the bare statistics, as in the following (real) example from http://www.rru.com/ :Removing very common words... no words removed. Writing main index... 10611 unique words indexed. Writing file index... 408 files indexed. Running time: 15 seconds. Indexing done!verbosity 2 - directory-level commentary
Level 2 is the same as level 1, except that as swish traverses the directories to be indexed, it reports each directory it enters. Here's a partial, real-life example from the same website:Checking dir "/www/pages/rru"... Checking dir "/www/pages/rru/Images"... Checking dir "/www/pages/rru/RNN"... Checking dir "/www/pages/rru/RNN/Morgue"... Checking dir "/www/pages/rru/RNN/NS"... Checking dir "/www/pages/rru/RNN/Propoganda"...verbosity 3 - detailed commentary
Level 3 gives all the information of level 2, plus commentary as it indexes each file, telling how many unique words it found in that file. Here's another partial, real-life example from the same website:In dir "/www/pages/rru/RNN": headlines.html (168 words) index.html (202 words) morgue.html (270 words) In dir "/www/pages/rru/RNN/Spumoni": current.html (624 words) current.txt (740 words) In dir "/www/pages/rru/RNN/Spumoni/Extra": staff.html (51 words) subs.html (59 words) In dir "/www/pages/rru/RNN/Spumoni/Morgue": 001.html (755 words) 002.html (288 words) 003.html (705 words) 004.html (309 words) 005.html (1124 words) 005s.html (397 words) 007a.html (425 words) 008.html (559 words) 009.html (811 words)
- MaxHits
value
SWISH can return any number of hits, but
MaxHist
defines the default cut off limit for a given index.The built-in default is 50.
- FollowSymLinks
yes/no
Normally swish ignores symbolic links to files whe indexing. If you want it to follow such links, define this value as
yes
, else define it asno
.The default is
no
.
- NoContents
.suffix1 .suffix2 .suffix3 ...
This variable lets you control what files will have their contents indexed. If a file with a suffix in this list is indexed, only its file name (and not any words in the file) will be indexed. This is useful because normally SWISH will try to index the contents of every file, even files without words (such as images or movies). Suffix checking is case-insensitive.
- IgnoreWords
word1 word2 ...
Here you can specify words to ignore when searching. Usually these words (called stopwords are words that occur so often in your data that indexing them is not worthwhile. If you specify a word as
SwishDefault
, it will be replaced with swish's default list - a few hundred very common English words.
- IgnoreLimit
number1 number2
After indexing, swish can automatically tell which words are the most common and omit them from the index according to these parameters. Here are some examples:
1. IgnoreLimit 80 256 2. IgnoreLimit 50 50
- Swish will ignore all words that occur in over 80% of the files and that also occur in over 256 different files.
- Swish will ignore all words that occur in over 50% of the files and that also occur in over 50 different files.
Using IgnoreLimit and IgnoreWords can help trim the size of your index files considerably - experiment with parameters to see what works best at your site. You can also use IgnoreLimit to limit the CPU resources that searches take.
- IndexName
"value"
- IndexDescription
"value"
- IndexPointer
"value"
- IndexAdmin
"value"
- DocUrl
"value"
These variables specify information that goes into index files to help users and administrators. IndexName should be the name of your index, like a book title. IndexDescription is a short description of the index or a URL pointing to a more full description. IndexPointer should be a pointer to the original information, most likely a URL. IndexAdmin should be the name of the index maintainer and can include name and email information. These values should not be more than 70 or so characters and should be contained in quotes. Note that the automatically generated date in index files is in
D/M/Y
and 24-hour format. DocUrl overrides the default URL of the docs. If you install them locally, you can set the URL here.
Using ReplaceRules
When results are returned from swish searches, you may get a bunch of funny pathnames to files that you can't access. Using ResultRules, you can specify a series of operations to perform on the pathname result to change it into a URL and other things if you desire.There are three operations you can specify: replace, append, and prepend. They will parse the pathname in the order you've typed these commands. More than one command and its arguments can appear on the same line, but it's easier to read when commands are broken up over a few lines. You can't put a command and its argument(s) on different lines, however.
Here's the syntax:
ReplaceRules replace "the string you want replaced" "what to change it to" This replaces all occurrences of the old string with the new one. ReplaceRules prepend "a string to add before the result" ReplaceRules append "a string to add after the result"Study the sample configuration file in Appendix C and try things out. You'll find that by having swish return URLs instead of pathnames, you can create interfaces to swish that can allow users to get to the search results over the World-Wide Web.
Using FileRules
You can specify certain file directives in the configuration file - any files or directories matching these criteria will be ignored and will not be indexed. Append all of these operations to a FileRules directive:
- pathname contains
string1 string2 string3 ...
Any path names containing exactly these strings, whether they be paths to directories or paths to files, will be ignored. Using this you can avoid indexing temporary directories or private material.
- filename is
filename
Any file name exactly matching the specified file name will be ignored (this is case-sensitive). This cannot be a path.
- filename contains
string1 string2 string3 ...
Any file name containing these strings will be ignored (this is not case-sensitive). This cannot be a path.
- title contains
string1 string2 string3 ...
Any HTML file with a title that contains these strings will be ignored (this is case-insensitive).
- directory contains
string1 string2 string3 ...
Any directory that contains any of these specified file names will be ignored (this is case-insensitive).
Advanced indexed variables
These variables affect indexing by either changing the weight of a word, or by changing what constitutes a word. The defaults should be fine for most people. See the FAQ for suggestions on using these to fine tune your configuration.
- EmphasizeComments
yes/no
Defining this value as
yes
tells swish to emphasize words found in HTML comments when indexing. This means that words in comments will have a heavier weight when determining a file's score for the word[s]. Defining this value asno
tells swish to treat words in comments the same as any other words.The default is
no
.
- EmphasizeMetaTags
yes/no
Defining this value as
yes
tells swish to emphasize words found in META tags when indexing. This means that words in META tags will have a heavier weight when determining a file's score for the word[s]. Defining this value asno
tells swish to treat words in META tags the same as any other words.The default is
yes
.
- DefaultRule
and/or
This defines the default rule (
and
oror
) to apply when multiple search words are given with no boolean operator.The default is
and
.
- MinWordLimit
value
This defines the minimum word length in characters.
The default is
3
.
- MaxWordLimit
value
This defines the maximum word length in characters.
The default is
20
.
- TitleTopLines
value
This defines how deeply (in lines) swish will search for a TITLE tag.
The default is
8
.
- AsciiEntities
yes/no
If this is set to
yes
swish converts HTML ASCII entity names to their closest ASCII equivalent (for instance, resumé would become resume).Regardless of this setting, swish will convert numeric entities to their closest ASCII equivalents ((for instance, resumé would become resume).
The default is
yes
.
- IndexTags
yes/no
Normally, all data in tags in HTML files (except for words in comments and META tags) is ignored. If you want to index HTML files with the text within tags and all, define this to be
yes
. Only in rare cases should this be set tono
.The default is
no
.
- IgnoreAllV
yes/no
If set to
yes
swish will ignore words consisting only of vowels.The default is
yes
.
- IgnoreAllC
yes/no
If set to
yes
swish will ignore words consisting only of consonants.The default is
yes
.
- IgnoreAllN
yes/no
If set to
yes
swish will ignore words consisting only of numbers.The default is
yes
.
- IgnoreRowV
value
SWISH will ignore words with more than
value
consecutive vowels.The default is
3
.
- IgnoreRowC
value
SWISH will ignore words with more than
value
consecutive consonants.The default is
4
.
- IgnoreRowN
value
SWISH will ignore words with more than
value
consecutive numbers.The default is
3
.
- IgnoreSame
value
SWISH will ignore words with more than
value
consecutive, identical characters.The default is
3
.
Building configuration files
There are three basic ways to create configuration files:Each of these has a targeted user base, as explained in the descriptions below.
- the
ez-swish
web page interface (basic, user-level, configuration files)- the
mkswishconf
script (prompts for all configuration parameters)- roll your own (copy a sample or in-use config file, tweak it)
EZ swish web page interface
The first method is useful when users want to create their own indexes, and don't care about heavy customization of the indexing algorithm. The web page scripts provide reasonable defaults for most things, and let the user fill in a minimum of information. Since the web server cannot, by default, write into a user's directory, the config file is written to a known location (/tmp/$LOGNAME-swish.conf
), and the user is prompted to move it into the proper directory. If the user later desires, they can read this (self-documenting) file and play with the various configuration parameters.To use this, the ez-swish*cgi scripts should be installed in a known, public place, such as http://wherever/cgi-bin/ . The user points a web browser at
http://wherever/cgi-bin/ez-swish.cgi
and fills in the few blanks there. To get reasonable defaults, the user actually accesseshttp://wherever/cgi-bin/ez-swish.cgi?username
where username is the user's login name. Ths script, as delivered, assumes the user's web pages will be in$HOME/public_html
.
mkswishconf
Themkswishconf
script guides a user through creation of a fully customized swish configuration file. The script is invoked with a single parameter, the path of the config file to be created. Examples include:mkswishconf /www/httpd/conf/swish/acctg.conf mkswishconf ./swish.confThis is most useful for people who don't know, or have confidence in, any local, UNIX-based editor, or people who need extra handholding.
Roll your own
Most people simply copy one of the example files (inswish/Conf/
) and edit that. Since the files are fairly well documented, this will work for most people, especially after reading this manual.
Command line options
Running SWISH with the -z, -h or -? options shows us the command-line options (and a couple of other things we aren't worried about here). This chapter briefly covers configuration options; other options devoted only to searching or indexing are covered in the Users Guide.Ways to invoke swish:
swish [-i dir file ... ] [-c file] [-f file] [-l] [-v [num]] [-e] [-E] swish -w word1 word2 ... [-f file1 file2 ...] [-m num] [-t str] swish -M index1 index2 ... outputfile swish -D file swish -VOptions (defaults are in brackets):-i : create an index from the specified files -w : search for words "word1 word2 ..." -t : tags to search in - specify as a string "HBthec" - in head, body, title, header, emphasized, or comments -f : index file to create or search from [index.swish] -c : configuration file to use for indexing -l : follow symbolic links when indexing -v : verbosity level (0 to 3) [0] -e : emphasize words in comments when indexing -E : emphasize words in META tags when indexing -m : the maximum number of results to return [50] -M : merges index files -D : decodes an index file -V : prints the current version-m (number) (number of results)
While searching, this specifies the maximum number of results to return. The default is 50. If no numerical value is given, the default is assumed. If the value is 0 or the stringall
, there will be no limit to the number of results. There is no correspondingMaxHits
parameter in the configuration file.-i directory file ... (files to index)
This specifies the directories and/or files to index. Directories will be indexed recursively. This overridesIndexDir
in the configuration file.There is no default.
-c configfile ... (configuration file)
This specifies the configuration file to use for indexing or searching. You can use this as an only option to swish to do automatic indexing, if all the necessary variables are set in the configuration file.Many parameters in the configuration file may also be overridden by other command line options.
You can specify multiple configuration files in order to split up common preferences. For instance, you might store a file with the stopwords in it and have multiple other files that have different index file information.
example 1: swish -c swish.conf example 2: swish -i /usr/local/www -f index.swish -v -c swish.conf example 3: swish -c swish.conf stopwords.confNotes on examples:You can also use the same configuration file for multiple indexes, by specifying common parameters in the file and differing parameters (such as directories to index and the index file location) on the command line.
- The settings in
swish.conf
will be used to index the pages defined therein. Therefore all necessary parameters must be defined inswish.conf
.- The command-line options override the corresponding parameters in the configuration file.
- The variables in
swish.conf
will be read, then the variable instopwords.conf
will be read. Note that if the same variables occur in both files, older values may be added to, written over or ignored, depending on the parameter.example : swish -c /www/httpd/conf/swish/users.conf \ -f /u/meo/public_html/index.swish -i /u/meo/public_html swish -c /www/httpd/conf/swish/users.conf \ -f /u/sbo/public_html/index.swish -i /u/sbo/public_htmlThese commands generate indexes for two separate user's webs, with all other parameters in common. This requires, of course, an IndexTitle nebulous enough to work for both. Again, you could also use multiple configuration file - one per user and a common file with overall settings.-f indexfile1 (index file to create)
If you are indexing, this specifies the file to save the generated index in, and you can only specify one file. If you are searching, this specifies the index files (one or more) to search from. The default index file is
-f indexfile1 indexfile2 ... (index file[s] to search)index.swish
in the current directory.-l (symbolic links)
Specifying this option tells swish to follow symbolic links when indexing. This overrides theFollowSymLinks
configuration file parameter. The default isno
.-e (emphasize words in comments)
Specifying this option tells swish to emphasize words found in HTML comments when indexing. This means that words in comments will have a heavier weight when determining a file's score for the word[s]. This overrides theEmphasizeComments
configuration file parameter. The default isno
.-E (emphasize words in META tags)
Specifying this option tells swish to emphasize words found in META tags when indexing. This means that words in META tags will have a heavier weight when determining a file's score for the word[s]. This overrides theEmphasizeMetaTags
configuration file parameter. The default isyes
.-M indexfile1 indexfile2 indexfile3... (index merging)
This allows you to merge two or more index files - the last file you specify on the list will be the output file. Merging removes all redundant file and word data. To estimate how much memory the operation will need, sum up the sizes of the files to be merged and divide by two. That's about the maximum amount of memory that will be used. You can use the -v option to produce feedback while merging and the -c option with a configuration file to include new administrative information in the new index file.There are no defaults for this option.
-D [indexfile] (decode)
This option is provided so you can check the word, file, and maintenance information in index files. You can specify multiple files to decode.There is no default for this option.
-v [number] (verbosity option)
The -v option can take a numerical value from0
to3
. Specify0
for completely silent operation and3
for detailed reports. This overrides theIndexReport
configuration file parameter. If nonumber
is specified, the verbosity is set to0
.verbosity 0 - silent running
Level 0 is silent for normal operation. Only errors are reported.verbosity 1 - normal details
Level 1 lets you know the bare statistics, as in the following (real) example from http://www.rru.com/ :Removing very common words... no words removed. Writing main index... 10611 unique words indexed. Writing file index... 408 files indexed. Running time: 15 seconds. Indexing done!verbosity 2 - directory-level commentary
Level 2 is the same as level 1, except that as swish traverses the directories to be indexed, it reports each directory it enters. Here's a partial, real-life example from the same website:Checking dir "/www/pages/rru"... Checking dir "/www/pages/rru/Images"... Checking dir "/www/pages/rru/RNN"... Checking dir "/www/pages/rru/RNN/Morgue"... Checking dir "/www/pages/rru/RNN/NS"... Checking dir "/www/pages/rru/RNN/Propoganda"...verbosity 3 - detailed commentary
Level 3 gives all the information of level 2, plus commentary as it indexes each file, telling how many unique words it found in that file. Here's another partial, real-life example from the same website:In dir "/www/pages/rru/RNN": headlines.html (168 words) index.html (202 words) morgue.html (270 words) In dir "/www/pages/rru/RNN/Spumoni": current.html (624 words) current.txt (740 words) In dir "/www/pages/rru/RNN/Spumoni/Extra": staff.html (51 words) subs.html (59 words) In dir "/www/pages/rru/RNN/Spumoni/Morgue": 001.html (755 words) 002.html (288 words) 003.html (705 words) 004.html (309 words) 005.html (1124 words) 005s.html (397 words) 007a.html (425 words) 008.html (559 words) 009.html (811 words)-V (version option)
The -V option causes swish to spit out its version number. The result looks like this:swish 1.2.1
[Index] [Previous Chapter] [Next Chapter]
Last update: 18/Aug/1998