Simple word indexer (14)
20–. Continued from the previous.
Positioning to a character (2)
As described in the previous episode, this is about pointing into a file that contains UTF-8 text data. From the known start of a word (which of course we know starts with a valid character), we want to go back some number of bytes. Because UTF-8 text can contain characters that are 1, 2, 3 or 4 bytes long, it cannot be known in advance whether that new position is also the start of a valid character. It might just as well point to the 2nd, 3rd or 4th byte of it. Not a problem, as this is detectable, by trying to read a character from that position.
However, before being able to do so, in a test program I wrote
   to assess the behaviour, the library function fseek
   went into an infinite loop. This bug was one of my reasons for
   opting for a more traditional, byte-oriented approach.
The endless loop occurred under:
- Linux Mint 20.1 with kernel 5.4.0-81-generic, gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04), target: x86_64-linux-gnu, GNU ld (GNU Binutils for Ubuntu) 2.34, glibc version 2.31.
- Ubuntu Server 20.04 with kernel 5.4.0-48-generic, gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04), target: x86_64-linux-gnu, GNU ld (GNU Binutils for Ubuntu) 2.34, glibc version 2.31.
- Ubuntu Server 20.10 with kernel 5.11.0-31-generic, gcc version 10.3.0 (Ubuntu 10.3.0-1ubuntu1), target: x86_64-linux-gnu, GNU ld (GNU Binutils for Ubuntu) 2.36.1, glibc version 2.33.
In the release notes to version 2.32 and version 2.33, I noticed descriptions that may be akin to this bug, so I was hoping 2.33 would not have it. In the release notes to version 2.34 (which I did not test; UPDATE: I did!) I see nothing similar, so chances are slim that the bug has been fixed in that one.
My test program siworin14.c is in the
   subdirectory src.
   If defines a string containing the Portuguese for ‘there is’,
   Há, with an accented a containing 2 bytes in UTF-8.
   Then follows the Greek word for ‘yes’, 3 characters, 2 bytes each;
   then the Georgian word for ‘no’, ara, 3 characters of 3
   bytes each.
First I test character- and string-based conversions from multibyte
   to wide character. This works well under all OS’es and libraries.
   Then I write the string to a disk file named in. That
   file is closed, and reopened in two different streams, one in
   character mode (by using traditional functions) and one in
   wide-character mode (by using the more modern wide-character
   functions).
The fact that the same file is opened in two streams simultaneously is not the cause of the problem. I know because I also tested it with only the wide-character stream. (See also the simpler program described in the next episode.)
The program does an fseek to all byte positions, whether
   a valid UTF-8 character starts there or not. The infinite loop in
   fseek  starts at an invalid sequence, although it shouldn’t.
   Correct output, obtained from FreeBSD 12.2, which does not have the bug,
   is as follows:
retval = 1, cw = 'H' retval = 13, chars = Há, ναι, არა retval = 2, cw = 'á' retval = 12, chars = á, ναι, არა retval = -1, cw = 'á' retval = -1, chars = á, ναι, არა retval = 1, cw = ',' retval = 11, chars = , ναι, არა retval = 1, cw = ' ' retval = 10, chars = ναι, არა retval = 2, cw = 'ν' retval = 9, chars = ναι, არა retval = -1, cw = 'ν' retval = -1, chars = ναι, არა retval = 2, cw = 'α' retval = 8, chars = αι, არა retval = -1, cw = 'α' retval = -1, chars = αι, არა retval = 2, cw = 'ι' retval = 7, chars = ι, არა retval = -1, cw = 'ι' retval = -1, chars = ι, არა retval = 1, cw = ',' retval = 6, chars = , არა retval = 1, cw = ' ' retval = 5, chars = არა retval = 3, cw = 'ა' retval = 4, chars = არა retval = -1, cw = 'ა' retval = -1, chars = არა retval = -1, cw = 'ა' retval = -1, chars = არა retval = 3, cw = 'რ' retval = 3, chars = რა retval = -1, cw = 'რ' retval = -1, chars = რა retval = -1, cw = 'რ' retval = -1, chars = რა retval = 3, cw = 'ა' retval = 2, chars = ა retval = -1, cw = 'ა' retval = -1, chars = ა retval = -1, cw = 'ა' retval = -1, chars = ა retval = 1, cw = '?' retval = 1, chars = Pos 0, char 48-H, wide char 00000048-H Pos 1, char c3-?, wide char 000000e1-á Line 69, error 86 Illegal byte sequence Pos 2, char a1-?, wide char ffffffff-? Pos 3, char 2c-,, wide char 0000002c-, Pos 4, char 20- , wide char 00000020- Pos 5, char ce-?, wide char 000003bd-ν Line 69, error 86 Illegal byte sequence Pos 6, char bd-?, wide char ffffffff-? Pos 7, char ce-?, wide char 000003b1-α Line 69, error 86 Illegal byte sequence Pos 8, char b1-?, wide char ffffffff-? Pos 9, char ce-?, wide char 000003b9-ι Line 69, error 86 Illegal byte sequence Pos 10, char b9-?, wide char ffffffff-? Pos 11, char 2c-,, wide char 0000002c-, Pos 12, char 20- , wide char 00000020- Pos 13, char e1-?, wide char 000010d0-ა Line 69, error 86 Illegal byte sequence Pos 14, char 83-?, wide char ffffffff-? Line 69, error 86 Illegal byte sequence Pos 15, char 90-?, wide char ffffffff-? Pos 16, char e1-?, wide char 000010e0-რ Line 69, error 86 Illegal byte sequence Pos 17, char 83-?, wide char ffffffff-? Line 69, error 86 Illegal byte sequence Pos 18, char a0-?, wide char ffffffff-? Pos 19, char e1-?, wide char 000010d0-ა Line 69, error 86 Illegal byte sequence Pos 20, char 83-?, wide char ffffffff-? Line 69, error 86 Illegal byte sequence Pos 21, char 90-?, wide char ffffffff-? Pos 22, char 0a-?, wide char 0000000a-?
Under GNU 2.31 and 2.33, after dealing with the plain ASCII
   ‘H’ and the multibyte ‘á’, the program never reaches line
   68 to try to read the invalid UTF-8 follow-up byte on its
   own. Already at the fseek at line 63, it starts
   looping without end. Fans start kicking in, for fear of an
   overheated processor chip.
GNU’s debugger gdb let me debug the program even
   into the library. That way I found that the loop happens in
   source file
   libio/wfileops.c, function
   adjust_wide_data starting at line 547, the
   do while loop. Notable source lines that I kept
   seeing were:
   libio/wfileops.c line 576,
   libio/iofwide.c line 189, and
   iconv/skeleton.c line 399.
I don’t really see why there should be a loop in the
   first place. And why is iconv involved? I can imagine
   that when reading or writing characters, when filling a buffer,
   conversions between multibyte and wide chars need to take place,
   and that is what iconv is for. But fseek
   is just positioning a file pointer to a byte position, as a
   preparation for future (often imminent) reads
   or writes.
Isn’t
   fseek(stream, offset, whence)
   
always the same as
   lseek(fileno(stream), offset, whence)?
   
Going from man 3 C library functions to
   man 2 system calls?
But I know it’s easy to
   say that without having fully studied, or even done, the
   implementation of the whole of stdio. Things
   that seem simple, and are simple, can still become complicated
   when considering all the details. I know that from experience.
To end this article with, an observation about
   ftell/fseek versus
   fgetpos/fsetpos. The first two functions work
   with an exact byte position, which can be awkward in a modern
   environment that can deal with multibyte characters of
   variable length. But the functions should nevertheless work
   correctly.
The second two functions work with a data type
   fpos_t, the internal contents of which are
   system-defined, so the programmer should not make assumptions
   about them. fsetpos should only be called with an
   fpos_t value validly obtained from an
   fgetpos call. That ensures that we’re always
   starting at the start of a multibyte character. It is
   preferable to work like that whenever possible. But it
   isn’t always, as in my simple word indexer example that
   this series of articles is about.
Copyright © 2021 by R. Harmsen, all rights reserved.