lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]




On Thu, Jul 12, 2018, 6:00 AM Viacheslav Usov <via.usov@gmail.com> wrote:
On Wed, Jul 11, 2018 at 10:59 PM Gregg Reynolds <dev@mobileink.com> wrote:
On Wed, Jul 11, 2018, 1:43 AM Dirk Laurie <dirk.laurie@gmail.com> wrote:
...
>From the point of view of the utf8 library, UTF-8 is a reversible way
of mapping a certain subset of strings (which I here call "codons",
borrowing a term from DNA theory) onto a certain subset of 32-bit
integers.

Not even wrong. https://en.m.wikipedia.org/wiki/Not_even_wrong. Utf8 has nothing to do with "a certain subset of 32 bit integers".

Part of the claim that you are trying to refute was "UTF-8 is a reversible way of mapping X onto a certain subset of 32-bit integers." That part is certainly true. The set of all Unicode codepoints is isomorphic with a certain subset of 32 bit integers, [0, 0x10ffff] to be exact, and the whole point of any Unicode encoding, including UTF-8, by definition, is a reversible mapping onto the set of Unicode codepoints.

What can I say? I am an incorrigibly pedantic weenie, after all. Heh. There are no "32 bit integers" (altho there are base 2 representations of ints with 32 places.) UTF-8 is not a mapping from octet seqs to UTF-32 bitstrings. It's just another respresentation of ints, mutually isomorphic with any other.

Pedantic? Sure, but Unicode itself is very fastidious about this kinda stuff. Codepoints are numbers (abstract); bit patterns are code units.  Unicode expresses codepoints in hex notation, not code units (i.e. they are not "32 bit integers"). Etc.

If we had no legacy encodings we prolly would not need this kinda fastidiousness, but since we do the precision is helpful.
...

While one could argue that some platforms might have int longer than 21 bits but shorter than 32, I am afraid that, practically, utf8 is broken if int is shorter than 32 bits. I do not think this is intentional, so it probably needs to be fixed.

Nice catch!