[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: What do you miss most in Lua
- From: Tim Mensch <tim-lua-l@...>
- Date: Tue, 07 Feb 2012 10:55:51 -0700
On 2/7/2012 4:13 AM, David Given wrote:
What do you mean by a 'character'? A Unicode code point?
I can't speak to what HyperHacker means, but when I manipulated UTF-8
for handling internationalization on over 50 games, I never had to deal
with a SINGLE instance of grapheme clusters breaking the code. I ignored
them, and those 50 games were translated into at least 3 languages each,
and some as many as 9 languages (including Chinese, Japanese, and Arabic).
I would suggest that a LOT of UTF-8 usage in the real world follows that
pattern; not everyone is writing text entry fields.
If you're having to deal with completely arbitrary Unicode, then yes,
you need to deal with grapheme clusters. An optimization we added for
UTF-8 code points would actually be useful there, though: In each string
we cached the last code point offset requested. If you asked for s[i],
it would find the i'th code point, and remember the binary offset at
that code point, so loops like:
for (int i=0; i<s.length(); ++i)
{
// do something with s[i]
}
...would only need to "walk" the string from the last code point, which
is an O(1) operation up or down. You could use the same cache to "walk"
grapheme clusters, and then it's mostly O(1), unless you have crazy
large clusters making up the text, at which point it's O(M) where M is
cluster length.
In neither case does it necessarily make the library hideously
heavyweight, though, unless adding an index/offset pair to each UTF-8
string is hideously overweight. The only obvious change is that, instead
of returning a code point, which can fit in an int, the s[i] above would
return a binary blob (effectively, an opaque string).
Tim