lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


HyperHacker wrote:
[...]
> I do think a simple UTF-8 library would be quite a good thing to have
> - basically just have all of Lua's string methods, but operating on
> characters instead of bytes.

What do you mean by a 'character'? A Unicode code point? A grapheme
cluster?

If you split the string on code points you'll end up breaking grapheme
clusters in the middle, which will break any combining characters. If
you split the string on grapheme clusters you'll preserve the ability to
do random access into the string, but your string manipulation library
now becomes hideously heavyweight: grapheme clusters can be *any length*
(although there seems to be a promise that normalised Unicode won't have
any grapheme clusters longer than 32 code points).

The standard intuition that strings are made up of an array of
characters is, unfortunately, not really true in Unicode. It's basically
not possible to do random access into a Unicode string without jumping
through painful hoops.

-- 
┌─── dg@cowlark.com ───── http://www.cowlark.com ─────
│ "I have always wished for my computer to be as easy to use as my
│ telephone; my wish has come true because I can no longer figure out
│ how to use my telephone." --- Bjarne Stroustrup

Attachment: signature.asc
Description: OpenPGP digital signature