[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
- From: David Given <dg@...>
- Date: Thu, 09 Feb 2012 20:48:08 +0000
On 09/02/12 18:37, Roberto Ierusalimschy wrote:
[...]
> utf8.codepoint(s, i, j) -> code points in s from *byte* offset i to j
> (default i=1, j=i); i adjusts backward and j adjusts forward until a
> proper frontier. (It might be useful another function to return a table
> with those code points; {utf8.codepoint(s, 1, -1)} may be too heavy.)
The primitive I use most when dealing with Unicode is 'given a byte
offset i, get me the next code point and advance i accordingly'. This
nearly does that, but not quite.
In hindsight, what I *should* have been using was 'given a byte offset
i, get me the next *grapheme cluster* (as a string) and advance i
accordingly'. Unfortunately, while I do know there are rules for
automatically determining grapheme cluster boundaries, I suspect they're
too heavy for this kind of low-level stuff.
Incidentally, just for fun, I recently found this grapheme cluster,
which appears to be the longest one usable in real life:
U+0f67 U+0f90 U+0fb5 U+0fa8 U+0fb3 U+0fba U+0fbc U+0fbb U+0f82
It's the Tibetan symbol HAKṢHMALAWARAYAṀ, and looks like this (if you're
lucky):
ཧྐྵྨླྺྼྻྂ
--
┌─── dg@cowlark.com ───── http://www.cowlark.com ─────
│
│ "Never attribute to malice what can be adequately explained by
│ stupidity." --- Nick Diamos (Hanlon's Razor)
Attachment:
signature.asc
Description: OpenPGP digital signature
- References:
- Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Jay Carlson
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Dirk Laurie
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Rob Hoelz
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Sam Roberts
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Roberto Ierusalimschy