[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
- From: William Ahern <william@...>
- Date: Fri, 10 Feb 2012 11:25:33 -0800
On Fri, Feb 10, 2012 at 02:53:31PM +0100, Bernd Eggink wrote:
<snip>
> For short strings I find a different approach more convenient: Transform
> the Lua string into an array of strings, where each element contains a
> complete UTF-8 sequence, and then operate on that array. This may be
> more expensive with regard to memory, but IMO it's easier to handle, and
> probably also faster (no need to iterate through the string to find the
> n-th character, etc.). Except for the pattern matching functions, most
> string functions can easily be re-written for this data type, often as
> one-liners. After editing, a simple table.concat() transforms this
> structure back into a Lua string.
Why not an array of numbers? Perl concocted a "grapheme normalization form",
NFG, that reduced all grapheme clusters to a single codepoint.
Not unlike the way strings are internalized in Lua, each unique grapheme
cluster is dynamically assigned a codepoint at runtime, so that clusters can
be easily compared.
Rather than just exploding a string into a huge list or table, an iterator
over grapheme clusters would just be pretty nifty all by itself. And fast,
as you're just returning a number each time.
- References:
- Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Jay Carlson
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Dirk Laurie
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Rob Hoelz
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Sam Roberts
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Roberto Ierusalimschy
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Bernd Eggink