Simple word indexer (12)
14 and , 19 September 2021. Continued from the previous.
Binary search on disk
The standard C library functions lfind,
   lsearch, bsearch, and
   qsort are similar in that they operate
   on an array. The caller should specify the start of
   the array, the number of elements, and the size of
   an element.
In my case of a simple word indexer, I do not have an array in core (core is an old-fashioned expression for internal working memory, or RAM), but disk files instead. I don’t want to read any disk files into memory, I prefer to rely on automatic buffering or disk caching by the operating system.
I soon found that even then I can use the standard
   bsearch. As a base pointer I pass a character
   pointer that I did declare, but that doesn’t point to an array.
   nmemb is the number of lines in each of my index
   files. One file has a fixed number of bytes in each line
   (namely the length of a hexadecimal offset, plus a newline),
   so I can efficiently calculate the number of lines by asking
   fstat for the number of bytes in the file.
As a size, I specify sizeof char,
   which is effectively 1. See the functions
   siworin_handle_srchwrds and
   siworin_calcnumwords in
   siworin-srchind.c.
Now isn’t that dangerous? Passing a pointer to a non-existent
   array? The bsearch function will repeatedly
   calculate a pointer to an array element based on that, and pass
   it to the specified function for comparison with the key: is the
   key greater than the element, less than the element, or equal to
   it? See function siworin_compar in source file
   siworin-srchind.c.
Yes, that is dangerous and illegal if you ever dereference such a pointer. That would almost certainly cause a segmentation fault or a similar catastrophic error. Because the pointer points to an array element that doesn’t exist, an element for which no storage has ever been allocated.
But I never do that. Instead of dereferencing the pointer,
   i.e. access the element, I use the pointer only to calculate
   the difference with the base pointer. In other words, although
   it is a pointer that is passed to the comparison function, I
   use it as an index number. With that index number, I can access
   my disk file (first siworin.wordoffs, then
   siworin.words), and make the comparison between the
   key and the word found on disk.
Strange though it may seem, I think this is safe, and portable, i.e. it works correctly regardless of processor architecture.
Because I allocate siworin_fakebase
   as a global variable, not an automatic
   stack-based variable, it is automatically initialised to all zero
   bits, so the pointer is NULL. Is there any architecture, in which
   you cannot add an integer value up to a few hundred thousands,
   or even millions, to a NULL pointer to obtain something that could
   be a valid pointer? It doesn’t point to validly allocated memory,
   but that is a different matter.
In fact, bsearch adds a proportion of size
   to base, and in the comparison function, I subtract
   base from that result, the reverse operation. How
   could that go wrong? I don’t see it.
So yes, the standard bsearch can be
   used to search an array in memory, but also a disk file.
An array, not a pointer
On second thought (15 August), a pointer being NULL may be seen as a special status, meaning the pointer is invalid for dereferencing. Then doing pointer arithmetic with NULL is rather strange, and potentially dangerous, even if in practice it usually isn’t.
To improve this, instead of a pointer I now declare a character
   array, initialised with an empty string, which of course then
   consists of one character, a null byte. The address of that array
   is then passed to bsearch, and used to calculate
   the index in the comparison function siworin_compar.
This cast pointer points to legitimately allocated storage, even if it has a size of just one byte. So adding an integer to that seems more sensible and more responsible than adding it to NULL.
Test: it works flawlessly. But that doesn’t prove that it always will, in any environment. Yet, it is my opinion that this is safe and portable.
Copyright © 2021 by R. Harmsen, all rights reserved.