Simple word indexer (9)

12 and 26/27 August 2021. Continued from the previous.

Word separation (3)

Function ftell and efficiency

The purpose of the word extractor is to record each word that is found in an HTML file, together with its location, i.e., the byte count just before having read the first character of the word. Whether a character is the start of a new word, can be determined only after reading and examining it. Then it’s too late. The position must have been noted before the read.

Originally, I simply called ftell before reading each character. For single byte characters, that is quite fast: ftell simply returns the value of an internal counter that keeps track of the current byte position in the stdio buffer.

When I switched to wide characters, suddenly the program became a lot slower, from seconds to a full minute, with the fan audibly struggling to avert processor overheating. This happened under Linux Mint and Ubuntu Server, but not under FreeBSD. Different C library implementation? Different compiler? GNU versus clang?

By excluding possible causes step by step, I found out with certainty that the cause was really in ftell. I wrote a little test program that for every character repeatedly called ftell one thousand times. By calling gettimeofday before and after, with microsecond resolution, I could determine how long it took.

I found that in every 4 kB of input (4096 bytes), the time needed for one call of ftell increases from about 100 ns to over 4 µs! Up by a factor 40! With the start of a new block, every time the time suddenly fell back to the low value. A fragment of the output from the test program:

c=<, pos=4090 dur  4155 ns
c=/, pos=4091 dur  4164 ns
c=a, pos=4092 dur  4205 ns
c=>, pos=4093 dur  4157 ns
c=?, pos=4094 dur  4158 ns
c=<, pos=4095 dur  4160 ns
c=b, pos=4096 dur  4160 ns
c=r, pos=4097 dur   109 ns
c=>, pos=4098 dur   102 ns
c=?, pos=4099 dur   104 ns
c=<, pos=4100 dur   104 ns
c=a, pos=4101 dur   106 ns
c= , pos=4102 dur   106 ns
c=h, pos=4103 dur   107 ns
c=r, pos=4104 dur   108 ns
c=e, pos=4105 dur   110 ns
c=f, pos=4106 dur   110 ns

My attempt to explain this, is that the GNU implementation of stdio, when dealing with wide characters, maintains a character position counter in the buffer, but not a byte position counter. Because on disk, character length in bytes can vary between 1 and 4, that is not the same thing. Now every time someone wants to know the byte position, not the character position, by calling ftell, the implementation apparently loops through half the characters in the buffer, on average, calculating and counting their length in bytes.

I tried to verify this by looking at the code, but I didn’t manage to spot it. Now 4 microseconds is still a very short time, but when this is done for millions of bytes that are read from files, it certainly becomes significant.

My solution was to avoid calling ftell (still there as a comment, search: beforeread = ftell(fpi) in the source file), and replace it with wctomb (wide character to multibyte character conversion), which as a bonus returns the length in bytes, which I use to update my own byte counter. Much more efficient. Search: beforeread += wctomb(s, c);. That indeed made the program fast again.

Update 27 August 2021: test program published, bug reported.