Simple word indexer (14)

20–22 August 2021. Continued from the previous.

Positioning to a character (2)

As described in the previous episode, this is about pointing into a file that contains UTF-8 text data. From the known start of a word (which of course we know starts with a valid character), we want to go back some number of bytes. Because UTF-8 text can contain characters that are 1, 2, 3 or 4 bytes long, it cannot be known in advance whether that new position is also the start of a valid character. It might just as well point to the 2nd, 3rd or 4th byte of it. Not a problem, as this is detectable, by trying to read a character from that position.

However, before being able to do so, in a test program I wrote to assess the behaviour, the library function fseek went into an infinite loop. This bug was one of my reasons for opting for a more traditional, byte-oriented approach.

The endless loop occurred under:

Linux Mint 20.1 with kernel 5.4.0-81-generic, gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04), target: x86_64-linux-gnu, GNU ld (GNU Binutils for Ubuntu) 2.34, glibc version 2.31.
Ubuntu Server 20.04 with kernel 5.4.0-48-generic, gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04), target: x86_64-linux-gnu, GNU ld (GNU Binutils for Ubuntu) 2.34, glibc version 2.31.
Ubuntu Server 20.10 with kernel 5.11.0-31-generic, gcc version 10.3.0 (Ubuntu 10.3.0-1ubuntu1), target: x86_64-linux-gnu, GNU ld (GNU Binutils for Ubuntu) 2.36.1, glibc version 2.33.

In the release notes to version 2.32 and version 2.33, I noticed descriptions that may be akin to this bug, so I was hoping 2.33 would not have it. In the release notes to version 2.34 (which I did not test; UPDATE: I did!) I see nothing similar, so chances are slim that the bug has been fixed in that one.

My test program siworin14.c is in the subdirectory src. If defines a string containing the Portuguese for ‘there is’, Há, with an accented a containing 2 bytes in UTF-8. Then follows the Greek word for ‘yes’, 3 characters, 2 bytes each; then the Georgian word for ‘no’, ara, 3 characters of 3 bytes each.

First I test character- and string-based conversions from multibyte to wide character. This works well under all OS’es and libraries. Then I write the string to a disk file named in. That file is closed, and reopened in two different streams, one in character mode (by using traditional functions) and one in wide-character mode (by using the more modern wide-character functions).

The fact that the same file is opened in two streams simultaneously is not the cause of the problem. I know because I also tested it with only the wide-character stream. (See also the simpler program described in the next episode.)

The program does an fseek to all byte positions, whether a valid UTF-8 character starts there or not. The infinite loop in fseek starts at an invalid sequence, although it shouldn’t. Correct output, obtained from FreeBSD 12.2, which does not have the bug, is as follows:

retval =  1, cw = 'H'
retval = 13, chars = Há, ναι, არა
retval =  2, cw = 'á'
retval = 12, chars = á, ναι, არა
retval = -1, cw = 'á'
retval = -1, chars = á, ναι, არა
retval =  1, cw = ','
retval = 11, chars = , ναι, არა
retval =  1, cw = ' '
retval = 10, chars =  ναι, არა
retval =  2, cw = 'ν'
retval =  9, chars = ναι, არა
retval = -1, cw = 'ν'
retval = -1, chars = ναι, არა
retval =  2, cw = 'α'
retval =  8, chars = αι, არა
retval = -1, cw = 'α'
retval = -1, chars = αι, არა
retval =  2, cw = 'ι'
retval =  7, chars = ι, არა
retval = -1, cw = 'ι'
retval = -1, chars = ι, არა
retval =  1, cw = ','
retval =  6, chars = , არა
retval =  1, cw = ' '
retval =  5, chars =  არა
retval =  3, cw = 'ა'
retval =  4, chars = არა
retval = -1, cw = 'ა'
retval = -1, chars = არა
retval = -1, cw = 'ა'
retval = -1, chars = არა
retval =  3, cw = 'რ'
retval =  3, chars = რა
retval = -1, cw = 'რ'
retval = -1, chars = რა
retval = -1, cw = 'რ'
retval = -1, chars = რა
retval =  3, cw = 'ა'
retval =  2, chars = ა
retval = -1, cw = 'ა'
retval = -1, chars = ა
retval = -1, cw = 'ა'
retval = -1, chars = ა
retval =  1, cw = '?'
retval =  1, chars =

Pos  0, char 48-H, wide char 00000048-H
Pos  1, char c3-?, wide char 000000e1-á
Line 69, error 86 Illegal byte sequence
Pos  2, char a1-?, wide char ffffffff-?
Pos  3, char 2c-,, wide char 0000002c-,
Pos  4, char 20- , wide char 00000020-
Pos  5, char ce-?, wide char 000003bd-ν
Line 69, error 86 Illegal byte sequence
Pos  6, char bd-?, wide char ffffffff-?
Pos  7, char ce-?, wide char 000003b1-α
Line 69, error 86 Illegal byte sequence
Pos  8, char b1-?, wide char ffffffff-?
Pos  9, char ce-?, wide char 000003b9-ι
Line 69, error 86 Illegal byte sequence
Pos 10, char b9-?, wide char ffffffff-?
Pos 11, char 2c-,, wide char 0000002c-,
Pos 12, char 20- , wide char 00000020-
Pos 13, char e1-?, wide char 000010d0-ა
Line 69, error 86 Illegal byte sequence
Pos 14, char 83-?, wide char ffffffff-?
Line 69, error 86 Illegal byte sequence
Pos 15, char 90-?, wide char ffffffff-?
Pos 16, char e1-?, wide char 000010e0-რ
Line 69, error 86 Illegal byte sequence
Pos 17, char 83-?, wide char ffffffff-?
Line 69, error 86 Illegal byte sequence
Pos 18, char a0-?, wide char ffffffff-?
Pos 19, char e1-?, wide char 000010d0-ა
Line 69, error 86 Illegal byte sequence
Pos 20, char 83-?, wide char ffffffff-?
Line 69, error 86 Illegal byte sequence
Pos 21, char 90-?, wide char ffffffff-?
Pos 22, char 0a-?, wide char 0000000a-?

Under GNU 2.31 and 2.33, after dealing with the plain ASCII ‘H’ and the multibyte ‘á’, the program never reaches line 68 to try to read the invalid UTF-8 follow-up byte on its own. Already at the fseek at line 63, it starts looping without end. Fans start kicking in, for fear of an overheated processor chip.

GNU’s debugger gdb let me debug the program even into the library. That way I found that the loop happens in source file libio/wfileops.c, function adjust_wide_data starting at line 547, the do while loop. Notable source lines that I kept seeing were: libio/wfileops.c line 576, libio/iofwide.c line 189, and iconv/skeleton.c line 399.

I don’t really see why there should be a loop in the first place. And why is iconv involved? I can imagine that when reading or writing characters, when filling a buffer, conversions between multibyte and wide chars need to take place, and that is what iconv is for. But fseek is just positioning a file pointer to a byte position, as a preparation for future (often imminent) reads or writes.

Isn’t
fseek(stream, offset, whence)
always the same as
lseek(fileno(stream), offset, whence)?
Going from man 3 C library functions to man 2 system calls?

But I know it’s easy to say that without having fully studied, or even done, the implementation of the whole of stdio. Things that seem simple, and are simple, can still become complicated when considering all the details. I know that from experience.

To end this article with, an observation about ftell/fseek versus fgetpos/fsetpos. The first two functions work with an exact byte position, which can be awkward in a modern environment that can deal with multibyte characters of variable length. But the functions should nevertheless work correctly.

The second two functions work with a data type fpos_t, the internal contents of which are system-defined, so the programmer should not make assumptions about them. fsetpos should only be called with an fpos_t value validly obtained from an fgetpos call. That ensures that we’re always starting at the start of a multibyte character. It is preferable to work like that whenever possible. But it isn’t always, as in my simple word indexer example that this series of articles is about.

Pinpointing the bug