Simple word indexer (9)
12 and . Continued from the previous.
Word separation (3)
Function ftell and efficiency
The purpose of the word extractor is to record each word that is found in an HTML file, together with its location, i.e., the byte count just before having read the first character of the word. Whether a character is the start of a new word, can be determined only after reading and examining it. Then it’s too late. The position must have been noted before the read.
Originally, I simply called ftell before
   reading each character. For single byte characters, that
   is quite fast: ftell simply returns the
   value of an internal counter that keeps track of the
   current byte position in the stdio buffer.
When I switched
   to wide characters, suddenly the program became a lot
   slower, from seconds to a full minute, with the fan audibly
   struggling to avert processor overheating. This happened
   under Linux Mint and Ubuntu Server, but not under FreeBSD.
   Different C library implementation? Different compiler?
   GNU versus clang?
By excluding possible causes step by step, I found out
   with certainty that the cause was really in ftell.
   I wrote a little
   test program
   that for every character repeatedly called ftell
   one thousand times.
   By calling gettimeofday before and after,
   with microsecond resolution, I could determine how long it
   took.
I found that in every 4 kB of input (4096 bytes), the time
   needed for one call of ftell increases from
   about 100 ns to over 4 µs! Up by a factor 40!
   With the start of a new block, every time the time suddenly
   fell back to the low value. A fragment of the output from
   the test program:
c=<, pos=4090 dur 4155 ns c=/, pos=4091 dur 4164 ns c=a, pos=4092 dur 4205 ns c=>, pos=4093 dur 4157 ns c=?, pos=4094 dur 4158 ns c=<, pos=4095 dur 4160 ns c=b, pos=4096 dur 4160 ns c=r, pos=4097 dur 109 ns c=>, pos=4098 dur 102 ns c=?, pos=4099 dur 104 ns c=<, pos=4100 dur 104 ns c=a, pos=4101 dur 106 ns c= , pos=4102 dur 106 ns c=h, pos=4103 dur 107 ns c=r, pos=4104 dur 108 ns c=e, pos=4105 dur 110 ns c=f, pos=4106 dur 110 ns
My attempt to explain this, is that the GNU implementation
   of stdio, when dealing with wide characters,
   maintains a character position counter in the buffer, but
   not a byte position counter. Because on disk, character length
   in bytes can vary between 1 and 4, that is not the same thing.
   Now every time someone wants to know the byte position,
   not the character position, by calling ftell,
   the implementation apparently loops through half the characters
   in the buffer, on average, calculating and counting their
   length in bytes.
I tried to verify this by looking at the code, but I didn’t manage to spot it. Now 4 microseconds is still a very short time, but when this is done for millions of bytes that are read from files, it certainly becomes significant.
My solution was to avoid calling ftell (still
   there as a comment, search: beforeread = ftell(fpi)
   in the
   source file), and replace it with
   wctomb (wide character to multibyte character
   conversion), which as a bonus returns the length in bytes,
   which I use to update my own byte counter. Much more efficient.
   Search: beforeread += wctomb(s, c);. That indeed
   made the program
   fast again.
Update 27 August 2021: test program published, bug reported.
Copyright © 2021 by R. Harmsen, all rights reserved.