lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Thu, Jul 23, 2009 at 03:05:00PM +0100, David Given wrote:
...
> It turns out to be possible to programmatically split a Unicode string 
> up into its component grapheme clusters (what I was incorrectly 
> referring to as glyphs, and what most people think of as characters). 
> So, it ought to be fairly simple to do a Lua addon where you can say:
> 
> for c in s:graphemes() do
>   print(c)
> end
It is, and actually slnunicode does this, with a few caveats:

"
See http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
for default grapheme clusters.
Lazy westerners we are (and lacking the Hangul_Syllable_Type data),
we care for base char + Grapheme_Extend, but not for Hangul syllable sequences.

For http://unicode.org/Public/UNIDATA/UCD.html#Grapheme_Extend
we use Mn (NON_SPACING_MARK) + Me (ENCLOSING_MARK),
ignoring the 18 mostly south asian Other_Grapheme_Extend (16 Mc, 2 Cf) from
http://www.unicode.org/Public/UNIDATA/PropList.txt
"

It provides multiple string libs, one of which operates on graphemes,
meaning length, substr etc all count grapheme clusters.

> ...where c is a *string* containing a particular grapheme cluster (which 
> might be quite long; the link has an example of a four-code point 
> cluster). This would actually allow a string to be broken down into an 
> array of grapheme clusters to give true random access, which I'd 
> previously thought of as being impossible. It'd be expensive, though... 
> possibly it'd be worth doing lazily.
It boild down to snippets like:
      if (MODE_GRAPH == mode)
        while (Grapheme_Extend(code) && p>s) code = utf8_oced(&p, s);

It is not much more expensive than plain UTF-8,
which in turn is not more expensive than UTF-16 done right,
i.e. with checking for the surrogate pairs to encode characters
beyond the BMP.


enjoy