lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Fri, Feb 10, 2012 at 02:53:31PM +0100, Bernd Eggink wrote:
<snip>
> For short strings I find a different approach more convenient: Transform 
> the Lua string into an array of strings, where each element contains a 
> complete UTF-8 sequence, and then operate on that array. This may be 
> more expensive with regard to memory, but IMO it's easier to handle, and 
> probably also faster (no need to iterate through the string to find the 
> n-th character, etc.). Except for the pattern matching functions, most 
> string functions can easily be re-written for this data type, often as 
> one-liners. After editing, a simple table.concat() transforms this 
> structure back into a Lua string.

Why not an array of numbers? Perl concocted a "grapheme normalization form",
NFG, that reduced all grapheme clusters to a single codepoint.

Not unlike the way strings are internalized in Lua, each unique grapheme
cluster is dynamically assigned a codepoint at runtime, so that clusters can
be easily compared.

Rather than just exploding a string into a huge list or table, an iterator
over grapheme clusters would just be pretty nifty all by itself. And fast,
as you're just returning a number each time.