lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


2018-07-11 0:31 GMT+02:00 Gregg Reynolds <dev@mobileink.com>:
> On Tue, Jul 10, 2018, 5:17 PM Sean Conner <sean@conman.org> wrote:
>> It was thus said that the Great Gregg Reynolds once stated:
>> > On Tue, Jul 10, 2018, 4:44 PM Dirk Laurie <dirk.laurie@gmail.com> wrote:
>> > > I am merely asking for extra functions along the lines of what the
>> > > utf8 library already does.
>> > > E.g. Sam's examples:
>> > >
>> > > > s1 = "Hélène"
>> > > > s2 = "Hélène"
>>
>>   They look similar, but they are construct differently.
>>
>> > FYI these look identical on Android.
>> > > If you really not understand what I mean, I can elaborate.
>> > Please do.
>> > What does "len" mean? Number of Unicode chars ot number of bytes?
>>   The number of Unicode code points.  The second one has a letter 'e'
>> followed by a combining accent (I'm not sure which accent is the combining
>> one), thus the different number of Unicode code points.
> Ok, we have "codepoints", "chars", bytes, and heaven knows what else. Is a
> Unicode "codepoint" a byte? No. Is "Unicode codepoint" even meaningful?

The Lua manual refers Unicode twice.

"The UTF-8 encoding of a Unicode character can be inserted in a
literal string with the escape sequence \u{XXX} (note the mandatory
enclosing brackets), where XXX is a sequence of one or more
hexadecimal digits representing the character code point."

"This library does not provide any support for Unicode other than the
handling of the encoding. Any operation that needs the meaning of a
character, such as character classification, is outside its scope."

OK, that's Unicode out of the way. I am not talking about it.

>>From the point of view of the utf8 library, UTF-8 is a reversible way
of mapping a certain subset of strings (which I here call "codons",
borrowing a term from DNA theory) onto a certain subset of 32-bit
integers. Everything else about UTF-8, including its relation to
Unicode and its representation as glyphs, is totally irrelevant.

The two basic functions of the UTF-8 library are

    utf8.char  -- maps from one or more valid integers to a
concatenation of codons
    utf8.codepoint  -- maps from a valid concatenation of codons to
one or more integers

>>From the point of view of the string library, encoding is a reversible
way of mapping one-byte strings (commonly called "characters") onto
the integers 0 to 255. Everything else about strings, including their
representation as glyphs, is totally irrelevant.

The two basic functions of the string library are

    string.char  -- maps from one or more valid integers to a
concatenation of characters
    string.byte  -- maps from a concatenation of characters to one or
more integers

There is an obvious analogy between codons and characters,already
exploited in the names of the functions utf8.char and utf8.len. The
analogy defines what the (presently non-existent) functions utf8.find,
utf8.sub, utf8.match, utf8.reverse, utf8.rep, utf8.gsub and
utf8.gmatch should mean.

[1] http://lua-users.org/wiki/ZenOfLua
[2] Most modern systems have a way of graphically representing codons,
and even some pairs of codons, as a sequence of glyphs. In many cases
(including the one that sparked off the original thread) the mapping
from glyphs to codons is not unique. This, too, is irrelevant.