Simple word indexer (11)
. Continued from the previous.
Combining
The program siworin-makelst.c
   takes as input sorted lines, each of which contains a word plus
   the place where it occurs: file and byte-offset. The program
   writes a file, called siworin.words by default, which
   has one line per unique word, followed by one or more locations.
   An example of what this looks like was
   already
   given in the previous article.
The byte offset of each line written is recorded in a second
   file (default name is siworin.wordoffs), which has
   a fixed-length line for each word. The only thing in that line
   is the hexadecimal offset of the start of the line in the other
   file. This is meant to ease and simplify the searching algorithm:
   thus the word list can be thought of as an array.
It makes no sense to look up English words like ‘the’, ’and’
   and ‘that’, or a Dutch, Portuguese, Spanish, French or Interlingua
   word such as ‘de’. Instead of working with a stop list, the
   program siworin-makelst.c counts and limits the number
   of occurrences, assuming that the more often a word occurs, the less
   interesting it becomes to look for it.
The hard limit is #defined as SIWORIN_MAX_OCCURR in
   siworin.h, so it is compiled in. This should perhaps
   come from a configuration file. Or it could be calculated as
   a percentage of the total number of words.
Once it is determined that a word has more occurrences than the
   maximum allowed, the previous locations have already been written.
   To undo those writes, the file pointer must be put back to the start
   of the current line, so the data for the next word will overwrite it.
   That data may be shorter than the maximum, in which case a remnant
   of data for the word to be suppressed may still be there. For that
   reason a POSIX-defined ftruncate is done.
At first I did that after every suppressed entry, but that proved
   slow under FreeBSD, although not under Linux Mint and Ubuntu Server.
   To make it fast under any OS, I decided to do the ftruncate
   only once, at the end of the output file. This is in fact enough.
   Anything else must have been already overwritten by data for earlier
   words.
Copyright © 2021 by R. Harmsen, all rights reserved.