Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
From: David Given <dg@...>
Date: Thu, 09 Feb 2012 20:48:08 +0000

On 09/02/12 18:37, Roberto Ierusalimschy wrote:
[...]
> utf8.codepoint(s, i, j) -> code points in s from *byte* offset i to j
> (default i=1, j=i); i adjusts backward and j adjusts forward until a
> proper frontier. (It might be useful another function to return a table
> with those code points; {utf8.codepoint(s, 1, -1)} may be too heavy.)

The primitive I use most when dealing with Unicode is 'given a byte
offset i, get me the next code point and advance i accordingly'. This
nearly does that, but not quite.

In hindsight, what I *should* have been using was 'given a byte offset
i, get me the next *grapheme cluster* (as a string) and advance i
accordingly'. Unfortunately, while I do know there are rules for
automatically determining grapheme cluster boundaries, I suspect they're
too heavy for this kind of low-level stuff.

Incidentally, just for fun, I recently found this grapheme cluster,
which appears to be the longest one usable in real life:

U+0f67 U+0f90 U+0fb5 U+0fa8 U+0fb3 U+0fba U+0fbc U+0fbb U+0f82

It's the Tibetan symbol HAKṢHMALAWARAYAṀ, and looks like this (if you're
lucky):

ཧྐྵྨླྺྼྻྂ

-- 
┌─── ｄｇ＠ｃｏｗｌａｒｋ．ｃｏｍ ───── http://www.cowlark.com ─────
│
│ "Never attribute to malice what can be adequately explained by
│ stupidity." --- Nick Diamos (Hanlon's Razor)

Attachment: signature.asc
Description: OpenPGP digital signature

References:
- Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Jay Carlson
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Dirk Laurie
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Rob Hoelz
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Sam Roberts
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Roberto Ierusalimschy

Prev by Date: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Next by Date: Suggestion, Lua Sizes
Previous by thread: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Next by thread: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Index(es):
- Date
- Thread