Re: The Lua utf8 library (Was: Issues: Character 160 ...)

On Wed, Jul 11, 2018, 1:43 AM Dirk Laurie <dirk.laurie@gmail.com> wrote:

...

>From the point of view of the utf8 library, UTF-8 is a reversible way
of mapping a certain subset of strings (which I here call "codons",
borrowing a term from DNA theory) onto a certain subset of 32-bit
integers.

Not even wrong. https://en.m.wikipedia.org/wiki/Not_even_wrong. Utf8 has nothing to do with "a certain subset of 32 bit integers".

If you're talking about utf8, but you're not talking about Unicode, then what are you talking about? I'm not against it, I just don't see what you're after.

Everything

else about UTF-8, including its relation to
Unicode and its representation as glyphs, is totally irrelevant.

The two basic functions of the UTF-8 library are

utf8.char -- maps from one or more valid integers to a
concatenation of codons
utf8.codepoint -- maps from a valid concatenation of codons to
one or more integers

>From the point of view of the string library, encoding is a reversible
way of mapping one-byte strings (commonly called "characters") onto
the integers 0 to 255. Everything else about strings, including their
representation as glyphs, is totally irrelevant.

The two basic functions of the string library are

string.char -- maps from one or more valid integers to a
concatenation of characters
string.byte -- maps from a concatenation of characters to one or
more integers

There is an obvious analogy between codons and characters,already
exploited in the names of the functions utf8.char and utf8.len. The
analogy defines what the (presently non-existent) functions utf8.find,
utf8.sub, utf8.match, utf8.reverse, utf8.rep, utf8.gsub and
utf8.gmatch should mean.

[1] http://lua-users.org/wiki/ZenOfLua
[2] Most modern systems have a way of graphically representing codons,
and even some pairs of codons, as a sequence of glyphs. In many cases
(including the one that sparked off the original thread) the mapping
from glyphs to codons is not unique. This, too, is irrelevant.