Simple word indexer (19)
Ranking search results
Yesterday and today, I implemented a result ranking for my siworin search engine. Before that change, search results were presented in the order in which they were found. That often wasn’t optimal. So now I collect results before displaying them, assign a ranking to each, sort on that ranking, and then present the result, highest ranking first.
There is a new module consisting of siworin-ranking.h
   and siworin-ranking.c, that is in
   between siworin-match.c and siworin-displ.c.
   Where in siworin-match.c in function find_matches
   I used to call function siworin_displaymatches, instead
   I now call function siworin_ranking_add1hit of source file
   siworin-ranking.c.
There, the search results are collected in a malloc’ed and
   realloc’ed table, which was prepared by the function
   siworin_ranking_init. Once all results for the current search
   are there, the table is qsorted on rank, then function
   siworin_ranking_finalise traverses the table and repeatedly
   calls siworin_displaymatches to display the search results.
The ranking algorithm is rather simple:
- More words are better than fewer words.
- Infrequent words are special and therefore more relevant than trivial, frequent words.
- Adjacent words provide better hits than more distant words.
Memory management
In this C source siworin-ranking.c, I do something that may
   seem strange and dangerous (but I am convinced that it isn’t): I write
   binary values, of pointers and integers, into records that are not defined
   in C as those data types, but as a malloc’ed area of unsigned
   characters instead. I use the standard library function memcpy
   to do that.
To read back the data, in the statement
thisone = (struct hfile_wordvector_s *)
   (hitstab + i * binrecsize + sizeof (rank_t));
I cast a calculated pointer to unsigned char to a pointer to
   a structure struct hfile_wordvector_s, defined in
   siworin-match.h.
   That pointer I pass to function siworin_displaymatches, which
   then uses the data it points to (directly and indirectly) as if nothing strange
   had happened.
The reason for doing it the way I did it, is in the data structure
   for the search words and the matches, as created by function
   setup_matchvectors in source file siworin-match.c.
   The data structure can be mentally depicted by the words being next to each
   other horizontally.
   Each word has a pointer to a vertical array of locations in the HTML files,
   encoded as file number and byte offset. Each word also has an index number
   associated with it, into that locations array, so as to encode the match
   currently under investigation.
The number of words is variable between searches, but within one search it is always the same. Perhaps under newer standards of C, you could solve this with variable-size arrays. But I never understood how that works, and I’m not really interested in learning it.
The order of the words etc. can change even within a single search, because
   of the repeated qsort done in siworin-match.c,
   which means the index increment to find new potential matches needs only be
   done on the very first element.
I think my solution is legal in C, safe and portable, because the binary
   data is written, cast and read within the same C module (source file),
   so also within the same program and the same hardware platform. By
   consistently using sizeofs and a typedef, no
   assumptions are made about the size of pointers, floating point numbers
   and integers, nor about data alignment, nor about endianness. So nothing
   can ever go wrong.
I think. But do feel free to disagree.
Copyright © 2022 by R. Harmsen, all rights reserved.