Simple word indexer (8)
1, 10 and , 19 September 2021. Continued from the previous.
Word separation (2)
Tags
My word separator or extractor
   siworin-wordsep.c
   does not fully parse
   HTML, XML, XHTML or whatever, but only takes the crude measure of ignoring
   anything that is between < and >. Most of the time, that works well.
   However some text will be missed, although it contains words that should
   have been found. Examples:
- <meta name="description" content=
 "Text of the short description, some 160 chars max.">
- <meta name="keywords" content="Here is a comma-separated list of keywords">
- <img src="source_of_image_file.jpg" alt="What is there to see in this picture?">
An improvement would be to do real parsing, e.g. using Google’s Gumbo library (which I do already use elsewhere), and handle all the text elements that it finds. However, I have no plans to make such a change.
Addition 27 January 2022: Later on I made this behaviour switchable
   by config file, see SIWORIN_DONT_REMOVE_TAGS_WORDSEP
   
andSIWORIN_DONT_REMOVE_TAGS_CONTEXT in
   siworin-config.h, and
   dont_remove_tags_wordsep
and
   dont_remove_tags_context in the example config file
   siworin.conf.
Entities
­
An HTML entity is a sequence that starts with an ampersand (&) and ends with a semicolon (;), with simple ASCII in between. It can be used to encode more complicated characters. This can be done symbolically – К is a Cyrillic K, К; Κ is the Greek K, Κ – or by Unicode scalar number: ‑ or ‑ encodes a non-breaking hyphen.
The only entity I
   still use within a word is ­, for an optional hyphen, a
   suggested place to hyphenate a word if the need arises. I know a
   controversy exists as to the true nature of ­, also known as
   the ASCII and Unicode character 0xAD.
   Some say it is a soft hyphen, used by layout software to indicate
   visible hyphens that result from  the hyphenation algorithm, so they
   were not put there by the author of the text. In my opinion though,
   that makes no sense in HTML. I use ­ as an optional hyphen.
I normally have my web texts aligned both left and right, with any automatic hyphenation turned off. If the columns have a reasonable width, that works well even for languages that write composite words together, like Dutch and German, so they tend to have some rather long words. However, there are also a lot of short words to make up for that.
Occasionally, but surprisingly rarely, this approach creates text
   lines in which the words have excessive amounts of white space
   between them.
   Then, and only then, I place ­ hyphenation suggestions,
   preferably on morphologic boundaries, and such that the resulting
   parts do to differ in length too much.
   Regardless of the true meaning of ­, so far all the browsers
   that I have used, understood my intention, and hyphenated accordingly,
   clearing the lines of undesirable too long stretches of white space.
Now although I consider the string ­ part of the word, I do
   not want it to appear in the word as extracted from the text. For
   example, when in various places on my website I have the words:
verbrandingsmotoren
ver­brandings­motoren
verbrandings­motoren
verbran­dings­motoren
verbran­dingsmotoren
(Dutch for ‘combustion engines’) I do not want all five of those to
   appear as index words, to be found only when entered exactly like
   they are written in the text. Instead, I want just the single word
   ‘verbrandingsmotoren’ to be included, which finds me all
   the other occurrences as well.
This can be achieved by accepting the entity or entities as part of the word, but removing them before writing the word, and its location in HTML, to the file of words. Of course this requires special care when displaying search results with context: the word in the index and the word in the actual HTML may have a different length.
(See function CalcWrdLen, and its call in function
   HighlightWords, in source file
   siworin-displ.c, for a rather
   rudimentary solution.)
ä, ã, ç, etc.
In HTML, languages like German, French, Spanish, Portuguese and
   Italian can be written using just plain 7-bit ASCII, and still
   have all the correct accented letters. Words like überhaupt,
   Köln, großartig, français, élève,
   élevé, España, Ibáñez, coração,
   canções, and pietà
   would then be encoded as the cumbersome and ugly
   überhaupt, Köln,
   großartig, français,
   élève,
   élevé,
   España,
   Ibáñez,
   coração,
   canções and
   pietà.
I largely skipped that stage. With the exception of some remnants that I
   now find using my own prototype Siworin search engine, early
   on I made it a habit to write in ISO 8859-1, an encoding that covers
   all the languages I would ever write in, and most that I might ever cite
   a word from. They are spoken in a cross on the map of Europe, from Iceland
   to Albania, and from Finland to the Azores. Or that’s what I
   always said, but
   I now notice that the Canary Islands farther south are better suited.
But this too is a thing of the past: Unicode and UTF-8 now rule.
In the
   aforementioned
   earlier
   version
   of the word separator, I removed most entities, but kept those for
   accented letters and their like intact. In the new program, I remove
   only ­, and leave everything else. As a result
   they appear in the word list, and they can be found, but only by searching
   for part of the code. For example, searching for atilde;
   finds Covilhã in my
   photo page,
   where I find it is still written as Covilhã. But any
   occurrences of Covilhã properly written in UTF-8 are not found that way.
   Incorrect, but intended behaviour, because entities are simply
   not supported.
Full support would mean converting the entities to put them in the word list as UTF-8, and perhaps add an unaccented version too – as does Hyperestraier, and it does only that. But I’m not gonna implement anything like that, sorry. Too much work, and it violates my simplicity design criterion.
–,   ‑
I often encode n-dashes and m-dashes as entities, –
   and —. But they are rarely adjacent to alphabetic
   characters. If they are, siworin-wordsep.c will interpret
   them as part of the word.
The same is true of a non-breaking space,  .
   Examples in Dutch: à charge, t kofschip.
A special case is the non-breaking hyphen, ‑,
   which I sometimes use at the start of a suffix I mention, instead of a
   normal hyphen, which some unfortunate hyphenation algorithm might then
   put at the end of the line, all on its own. The non-breaking hyphen
   appears before the first letter (i.e. alphabetic character) of the
   suffix, so it won’t be included in the word list, because entity
   inclusion starts only after the first letter.
More comments about the word extractor.
Copyright © 2021 by R. Harmsen, all rights reserved.