lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


so far everybody seems to be ok with UTF-8 as the single encoding
which especially does not require any changes to the core.

Please let me stress the importance of distinguishing between
several string related functions:
(not top qoute, but lenghty intro)
a)	detecting code points (i.e. decoding the bytes to abstract
	unicode numbers) is straight forward and locale independent
b)	character class in unicode is completely independent of the locale
c)	grapheme boundaries are independent of the locale
	and with a few exceptions are easily detected
d)	casing is mostly independent of the locale
	with the main exception being the turkish I/i issue.
	(which is a design flaw in unicode, adopted for practical reasons)
e)	normalization is a little bit less straight forward but,
	IIRC, locale independent
f)	the single big bugger is collation.

No, that's not true, things only start to get interesting
with issues related to the graphical representation like
BiDi printing, indic super/subscripts and more.
But that's not covered by libc anyway. See the sila project.

a)-d) are already implemented in the selene unicode package
(currently being upgraded to 5.1 string functions) for the
standard cases and the exceptions can be added if necessary.
Adding e) is tedious but basically also straight forward
(AFAIK libc does not have normalization functions).

On Fri, Dec 30, 2005 at 10:32:05AM -0800, Jens Alfke wrote:
> (I've replied to a bunch of messages here rather than sending out six  
> separate replies...)
> On 29 Dec '05, at 9:17 AM, Chris Marrin wrote:
> >It allows you to add "incidental" characters without the need for a  
> >fully functional editor for that language. For instance, when I  
> >worked for Sony we had the need to add a few characters of Kanji on  
> >occasion. It's not easy to get a Kanji editor setup for a western  
> >keyboard, so adding direct unicode was more convenient. There are  
> >also some oddball symbols in the upper registers for math and  
> >chemistry and such that are easier to add using escapes.
> Also, in some projects there are guidelines that discourage the use  
> of non-ascii characters in source files (due to problems with  
> editors, source control systems, or other tools. In these situations  
> it's convenient to be able to use inline escapes to specify non-ascii  
> characters that commonly occur in human-readable text ... examples  
> would include ellipses, curly-quotes, emdashes, bullets, currency  
> symbols, as well as accented letters of course.
ok, several means have been devised for that,
including decimal byte escape sequences and a custom U function
(which you can beef up to even turn ... into ellipses)

> wrote:
> >IMO, with globalization, languages that don't support Unicode won't  
> >make the cut in the long run.
> I find it ironic that the three non-Unicode-savvy languages I use  
> (PHP, Ruby, Lua) all come from countries whose native languages use  
> non-ascii characters :)
> I'm keeping an eye on this thread because, if I end up using Lua for  
> any actual work projects, I18N is mandatory and I can't afford to run  
> into any walls when it comes time to make things work in Japanese or  
> Thai or Arabic. (Not like last time...See below for the problems I've  
> had with JavaScript regexps.)
even in PHP it's not a matter of the language, only of the libs.

> wrote:
> >I've done an awful lot of work with international character sets,  
> >and I personally consider any encoding of Unicode other than UTF-8  
> >to be obsolete. Looking at most other modern (i.e. not held back by  
> >backwards compatibility, e.g. Windows) users of Unicode, their  
> >authors appear to feel similarly.
> Yes, at least as an external representation. Internally, it can be  
> convenient to use a wide representation for speed of indexing into  
> strings, but a good string library should hide that from the client  
> as an implementation detail. (Disclaimer: I'm mostly knowledgeable  
> only about Java's and Mac OS X's string libraries.)
IMHO this is a misconception.
The 16 bit wide char is faster only if you really confine it
to UCS-2, i.e. only enconding the base multilingual plane
of the first 64K characters. Beyond that there is not only
etruscian, but also yet more chinese symbols.
To do the real thing, you'd have to use UTF-16, checking
for surrogate pairs. Most implementations don't do that
and thus are nothing but broken hacks.

> wrote:
> >Most apps can just treat strings as opaque byte streams.
> I agree, mostly; the one area I've run into problems has been with  
> regexp/pattern libraries. Any pattern that relies on "alphanumeric  
> characters" or "word boundaries" assumes a fair bit of Unicode  
> knowledge behind the scenes. I ran into this problem when  
> implementing a live search field in Safari RSS, since while JS  
> strings are Unicode-savvy, many implementations' regexp  
> implementations aren't, so character classes like "\w" or "\b" only  
> work on ascii alphanumerics.
see b) above; we have proper character classes on UTF-8.

Remains collation as big issue.
Again I'd strongly suggest to not rely on libc/setlocale for anything
but the most feeble uses.
In "de_DE" we have two collations and I'd bet for each of them
there exists some libc which will pick it as default for
One pro of using libc is that you will be sorting bug-compatible
to your local "sort" utility, but just drop the plan to exchange
"sorted" data remotely.
"There is no collation but the one you are using."

If you really need the full Unicode Collation Algorithm,
use ICU and especially ICU's equivalent of strxfrm.
Else you might want to consider resorting to simpler and faster
means like single level sorting (ignoring case and accents)
a/o means supporting other features like decoding the
transformed representation to the original or an equivalent.
A sample implementation is given in our malete database
and will be released as a lua utility with selene.