lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


> The null character ('\0' in C) is represented in Unicode as a 
> single, zero 
> byte. 

I believe it's a null word, not byte. Since in 16 bit Unicode,
the characters are each 16 bits, including the termination character.
Plus, in Unicode, you have characters which have 0x00 as one of the 
bytes in the word: for example, ASCII's mappings into Unicode:  
0x002E (Unicode) -> 0x2E (ASCII) == '.'

> It does string-order comparison:  "hi" <= "hello". Yes, this one
breaks
> an external Unicode system. Suggestions?

What would be really cool is if it didn't ever depend on 8 bit
character widths via the use a string length on everything. It
would seem you guys are pretty close to that now. Anyplace you 
use string functions would have to be replaced with the mem functions
that take a length. In luaV_strcomp, (strcoll) Ugh! hard to replace.
I'm familiar with normalization on Unicode - strcoll is essentially 
the same thing for 8 bit - apply normalization and then use strcmp to 
compare the strings. To use this you really have to know the width of
the 
character, so you can call wide / multibyte / ascii versions. Which is 
not very elegant. 

One solution would be to 

a) make the source as independent on character width as possible, so
what you end up with a just a few places where a call like "strcoll"
is used.

b) allow the user to define the function(s) used in these places, 
so for example I can set via a config file "luastrcoll" to "wstrcoll",
"mbstrcoll", "strcoll", "utf16strcoll", or some other routine. 

Then it's up to me to make sure I'm passing compatible strings into
the library. I could use any string format I wanted, as long as I 
provide a version of "luastrcoll" that worked with my string encoding
format.

As for the aux libs, I'd say leave it to the users who work with unicode
or utf16, or whatever to port them for you. :) 

Regards,
Jim