[BUG?] Re: The Lua utf8 library (Was: Issues: Character 160 ...)

Subject: [BUG?] Re: The Lua utf8 library (Was: Issues: Character 160 ...)
From: Viacheslav Usov &lt;via.usov@ ... &gt;
Date: Thu, 12 Jul 2018 12:59:01 +0200

On Wed, Jul 11, 2018 at 10:59 PM Gregg Reynolds <dev@mobileink.com> wrote:

On Wed, Jul 11, 2018, 1:43 AM Dirk Laurie <dirk.laurie@gmail.com> wrote:
...
>From the point of view of the utf8 library, UTF-8 is a reversible way
of mapping a certain subset of strings (which I here call "codons",
borrowing a term from DNA theory) onto a certain subset of 32-bit
integers.

Not even wrong. https://en.m.wikipedia.org/wiki/Not_even_wrong. Utf8 has nothing to do with "a certain subset of 32 bit integers".

Part of the claim that you are trying to refute was "UTF-8 is a reversible way of mapping X onto a certain subset of 32-bit integers." That part is certainly true. The set of all Unicode codepoints is isomorphic with a certain subset of 32 bit integers, [0, 0x10ffff] to be exact, and the whole point of any Unicode encoding, including UTF-8, by definition, is a reversible mapping onto the set of Unicode codepoints.

Another part was "from the point of view of the utf8". utf8 uses int (mixed with unsigned int) internally (see utf8_decode) to represent Unicode codepoints. On most modern platforms, int is a 32-bit integer, where the entire statement is correct as it stands. On some platforms, it is longer than 32 bits, but in this case the statement "a certain subset of 32-bit integers" trivially applies. The Lua integer that utf8 uses externally has either 32 or 64 bits, so "a certain subset of 32 bit integers" is still correct.

utf8 will work correctly if int has at least 22 bits, assuming signed two-complement's representation. While one could argue that some platforms might have int longer than 21 bits but shorter than 32, I am afraid that, practically, utf8 is broken if int is shorter than 32 bits. I do not think this is intentional, so it probably needs to be fixed.

Cheers,