        SWISH 1.2.1

     FAQ (Frequently Asked Questions)

     ------------------------------------------------------------------
                 [Index] [Previous Chapter] [Next Chapter]

     Questions ...

       1. Swish crashes and burns on a certain file. What can I do?
       2. How do I allow users on the Web to search my indexes?
       3. I want to make my own gateway program.
       4. How can I index all my compressed files?
       5. Can I index 8-bit text?
       6. How can I index phrases?
       7. How can I implement keywords in my documents?
       8. I want to generate a list of files to be indexed and pass it
          to swish.
       9. I run out of memory trying to index my files.
      10. How can I speed up indexing and/or shrink my index file size?
      11. When should I consider merging indexes?
      12. What other features are planned?

     ... and Answers

       1. Swish crashes and burns on a certain file. What can I do?

          You can use a FileRules operation to exclude the particular
          file name, or pathname, or its title. If there are serious
          problems in indexing certain types of files, they may not
          have valid text in them (they may be binary files, for
          instance). You can use NoContents to exclude that type of
          file.

       2. How do I allow users on the Web to search my indexes?

          Good question. You will need a gateway CGI program that
          presents users with a search form and options, calls swish
          with these options, and returns the data to them in a nice
          HTML format. Swish is not meant to do this. One
          swish-compatible gateway you can currently use is W4AIS
          available at http://www.rru.com/~meo/useful/www.html#w4ais.

       3. I want to make my own gateway program.

          Great! Good gateways can be made that take advantage of
          swish's features. If you do make one, even a simple one,
          please let me know and I can include it in the distribution.

       4. How can I index all my compressed files?

          Swish doesn't currently have the capability to do on-the-fly
          filtering of files. In the meantime, first index the
          uncompressed data, compress it, and using a ReplaceRules
          operation, change the suffix of indexed files to .Z or
          whatever is appropriate. That way users can retrieve the
          compressed information.

       5. Can I index 8-bit text?

          Yes, if the text uses the HTML equivalents for the
          ISO-Latin-1 (ISO8859-1) character set. Upon indexing swish
          will convert all numbered entities it finds (such as &#169;)
          to named entities (such as &copy;). To search for words
          including these codes, type the named entity (if it exists)
          in place of the 8-bit character. Swish will also convert
          entities to ASCII equivalents, so words that might look like
          this in HTML: resum&eacute; can be searched as this: resume.
          Please read the README file included with the distribution
          for information on changing these options.

       6. How can I index phrases?

          Currently the only way to do this is to use the HTML entity
          &#32; or &nbsp; (non-breaking space) to represent a space in
          your HTML. It will then be indexed with a space. To search
          for the phrase, you'd have to enter &#32; to represent a
          space also.

       7. How can I implement keywords in my documents?

          In your HTML files you can put keywords in comments, such as:

            <!-- keywords computer camera -->

          ...then when you search, swish should be called with the -t c
          option, such as:

            swish -t c -w keywords computer

          All documents that contains the words keywords and computer
          in their comments will then be returned. Swish has an option
          in the source code that you can define to give more relevance
          to the words inside comments; if you're doing keywords in
          this fashion, you may want to use that option.

       8. I want to generate a list of files to be indexed and pass it
          to swish.

          One thing you can do is make a simple script to generate a
          configuration file full of IndexDir directives. For instance,
          make a separate file called files.conf and put something like
          this in it:

            IndexDir /this_is_file_1/file.html
            IndexDir /usr/local/www
            IndexDir file2.html /some/directory/
            ...

          Then call swish like this (assuming you're using a main
          swish.conf file):

            swish -c swish.conf files.conf

       9. I run out of memory trying to index my files.

          It's true that indexing can take up a lot of memory! One
          thing you can do is make many indices of smaller content
          instead of trying to do everything at once. You can then
          merge all the smaller pieces together.

      10. How can I speed up indexing and/or shrink my index file size?

          Go through your installation and configuration with a fine
          toothed comb. Look at your runtime configuration file (these
          may also be compiled in as defaults by modifying the config.h
          files.):

             o Are you indexing file types you don't really care about?
               (IndexOnly)
             o Are you indexing only the names of files whose contents
               you don't care about, such as binary files, images, etc?
               (NoContents)
             o Are you skipping files and/or directories which you
               mighet prefer to ignore? (FileRules)
             o Are your limits for words to ignore because they are too
               frequent low enough? (IgnoreLimit)
             o Are you ignoring words you know should be ignored?
               (IgnoreWords) For instance, if your site involves heavy
               duty science or any topic where you are primarily
               interested in items which appear on only a few pages,
               you might set this very low.
             o Are you unnecessarily following symbolic links?
               (FollowSymLinks)
             o Are your limits for words to ignore because they are too
               frequent low enough? (IgnoreLimit)
             o Are you searching too deep for TITLE tags?
               (TitleTopLines) For instance, if you know that TITLE
               tags are never more than 4 lines deep, set this value to
               4.
             o Can you eliminate smaller words? Larger words?
               (MinWordLimit, MaxWordLimit) The minimum word size is
               usually most helpful. For instance, do you really need
               to index three letter words? Four letter words? But
               consider both.
             o Can you ignore things that aren't really words, such as
               those which are all vowels (may not apply in Hawaii),
               all consonants (may not apply in Wales), or all digits?
               What about things with long strings of vowels,
               consonants or digits? (IgnoreAllV, IgnoreAllC,
               IgnoreAllN, IgnoreRowV, IgnoreRowC, IgnoreRowN) What
               about the number of times single character can repeat?
               (IgnoreSame)
             o Are you unnecessarily indexing HTML tags? (IndexTags)
             o Are you unnecessarily indexing ASCII entities such as
               &amp;?? (AsciiEntities)
          Finally, a few parameters may currently be configured only by
          modifying config.h .
             o Have you defined the characters that comprise a ``word''
               in your index narrowly enough? (WORDCHARS, BEGINCHARS,
               ENDCHARS) For instance, if you don't care about strings
               with special characters such as &#233; in them, you can
               eliminate the characters "#;0123456789" (and possibly
               "&") from WORDCHARS, and probably "&" from BEGINCHARS
               and ";" from ENDCHARS. If you don't care about variable
               names in code or other names that end in digits, you
               could remove "0123456789" from ENDCHARS as well.

          Change these to minimize the words indexed, recompile, and
          test by creating a new index and searching against it. (You
          may wish to create this as a separate index from your
          production index until you settle on final parameters. Do not
          install a newly configured swish binary until you are
          satisfied that it works correctly; test in the build
          directory.

      11. When should I consider merging indexes?

          The most obvious time is when you need sub-indexes. For
          instance, each user might want to have their own index, but
          you might also want an index spanning all users.

          Merging can also help reduce indexing time.

          Merging is necessary when you have sets of pages with
          differing indexing requirements. For instance, a global index
          spanning user pages (words of all sizes), introductory
          material (primarily small words) and sophisticated technical
          material (longer words) would best be implemented by creating
          an index for each area, then merging them.

      12. What other features are planned?

          These are things I have been thinking about (some of these
          came from Kevin). They are highly dependent on my schedule.
             o Parse improvements
             o More optimization
             o A proximity search feature
             o Regular exporessions in searches
             o Stemming and soundex matches
             o File filtering
             o A server implementation
             o Distributed server implementation, including swish:
               scheme, or an implementation to the wais: scheme, or
               perhaps work on a global index: scheme
             o Search META tags
             o Interact with other indexing and meta-indexing systems
             o Easier building, installation and configuration

                 [Index] [Previous Chapter] [Next Chapter]
     ------------------------------------------------------------------
     Last update: 18/Aug/1998
