lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

On 2/7/2012 4:13 AM, David Given wrote:
What do you mean by a 'character'? A Unicode code point?

I can't speak to what HyperHacker means, but when I manipulated UTF-8 for handling internationalization on over 50 games, I never had to deal with a SINGLE instance of grapheme clusters breaking the code. I ignored them, and those 50 games were translated into at least 3 languages each, and some as many as 9 languages (including Chinese, Japanese, and Arabic).

I would suggest that a LOT of UTF-8 usage in the real world follows that pattern; not everyone is writing text entry fields.

If you're having to deal with completely arbitrary Unicode, then yes, you need to deal with grapheme clusters. An optimization we added for UTF-8 code points would actually be useful there, though: In each string we cached the last code point offset requested. If you asked for s[i], it would find the i'th code point, and remember the binary offset at that code point, so loops like:

for (int i=0; i<s.length(); ++i)
    // do something with s[i]

...would only need to "walk" the string from the last code point, which is an O(1) operation up or down. You could use the same cache to "walk" grapheme clusters, and then it's mostly O(1), unless you have crazy large clusters making up the text, at which point it's O(M) where M is cluster length.

In neither case does it necessarily make the library hideously heavyweight, though, unless adding an index/offset pair to each UTF-8 string is hideously overweight. The only obvious change is that, instead of returning a code point, which can fit in an int, the s[i] above would return a binary blob (effectively, an opaque string).