Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
From: William Ahern <william@...>
Date: Fri, 10 Feb 2012 11:25:33 -0800

On Fri, Feb 10, 2012 at 02:53:31PM +0100, Bernd Eggink wrote:
<snip>
> For short strings I find a different approach more convenient: Transform 
> the Lua string into an array of strings, where each element contains a 
> complete UTF-8 sequence, and then operate on that array. This may be 
> more expensive with regard to memory, but IMO it's easier to handle, and 
> probably also faster (no need to iterate through the string to find the 
> n-th character, etc.). Except for the pattern matching functions, most 
> string functions can easily be re-written for this data type, often as 
> one-liners. After editing, a simple table.concat() transforms this 
> structure back into a Lua string.

Why not an array of numbers? Perl concocted a "grapheme normalization form",
NFG, that reduced all grapheme clusters to a single codepoint.

Not unlike the way strings are internalized in Lua, each unique grapheme
cluster is dynamically assigned a codepoint at runtime, so that clusters can
be easily compared.

Rather than just exploding a string into a huge list or table, an iterator
over grapheme clusters would just be pretty nifty all by itself. And fast,
as you're just returning a number each time.

Follow-Ups:
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), David Given

References:
- Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Jay Carlson
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Dirk Laurie
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Rob Hoelz
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Sam Roberts
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Roberto Ierusalimschy
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Bernd Eggink

Prev by Date: Re: [ANN] Lua 5.1.5 (rc1) now available
Next by Date: RE: [ANN] Lua 5.1.5 (rc1) now available
Previous by thread: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Next by thread: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Index(es):
- Date
- Thread