lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On 09/02/12 18:37, Roberto Ierusalimschy wrote:
[...]
> utf8.codepoint(s, i, j) -> code points in s from *byte* offset i to j
> (default i=1, j=i); i adjusts backward and j adjusts forward until a
> proper frontier. (It might be useful another function to return a table
> with those code points; {utf8.codepoint(s, 1, -1)} may be too heavy.)

The primitive I use most when dealing with Unicode is 'given a byte
offset i, get me the next code point and advance i accordingly'. This
nearly does that, but not quite.

In hindsight, what I *should* have been using was 'given a byte offset
i, get me the next *grapheme cluster* (as a string) and advance i
accordingly'. Unfortunately, while I do know there are rules for
automatically determining grapheme cluster boundaries, I suspect they're
too heavy for this kind of low-level stuff.

Incidentally, just for fun, I recently found this grapheme cluster,
which appears to be the longest one usable in real life:

U+0f67 U+0f90 U+0fb5 U+0fa8 U+0fb3 U+0fba U+0fbc U+0fbb U+0f82

It's the Tibetan symbol HAKṢHMALAWARAYAṀ, and looks like this (if you're
lucky):

ཧྐྵྨླྺྼྻྂ








-- 
┌─── dg@cowlark.com ───── http://www.cowlark.com ─────
│
│ "Never attribute to malice what can be adequately explained by
│ stupidity." --- Nick Diamos (Hanlon's Razor)

Attachment: signature.asc
Description: OpenPGP digital signature