lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


there are 3 entities with unicode strings::

1 - The bytes according to the encoding used (UTF-8, UTF-16 Big Endian, UTF-16 Little endian, UTF-32)
2 - The unicode code points - The union of one or more bytes compose the code points
3 - And the trickest of they, the glyphs. One or more unicode code points compose a single glyph.

Example: This flag "🏴󠁧󠁢󠁥󠁮󠁧󠁿" is composed of 7 unicode code points, these code-points encoded as UTF-8 occupies 14 bytes.
A single glyph (the flag) is composed by 7 unicode code points, or 14 UTF-8 bytes.
Many emojis are union of more than 1 code point.... And there are the Composing Code Points .... A + ´  , (2 unicode code points)  that my be presented as "Á" by text editors/text presenters.

I think utf8.len() returns the quantity of Unicode Code Points, not glyphs...

PS: In Delphi, I made a library myself to handle glyphs, code points and bytes....

On Tue, Jul 10, 2018 at 6:56 PM Gregg Reynolds <dev@mobileink.com> wrote:


On Tue, Jul 10, 2018, 4:44 PM Gregg Reynolds <dev@mobileink.com> wrote:

 (e.g. numbers in ltr scripts).

Correction: numbers in rtl scripts. Unicode says that numbers in e.g. Arabic are ltr. This is complete BS, but it is also a fact on the ground that cannot be fixed. Extra credit: estimate the cost of this very fundamental mistake.


--
Alysson Cunha / AlyssonRPG
http://www.rrpg.com.br - Jogue o tradicional RPG de mesa online