lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Mark Hamburg wrote:
I haven't pounded on it extensively, but I've wired my simple Lua
environment (built in Cocoa on MacOS X) to work with UTF8 encoded strings
for input and output. I expect this to be fine so long as I:

* Don't want to disassemble strings into characters

I definitely want to do that. I need to compare parts of strings to other strings, pull out bits of strings and stick strings together.

* Use regular expressions that use things other than low-ASCII for matches
* Perform comparisons on strings other than for equality

And possibly this aswell.


What this relies on is that:

* Lua fully supports essentially any 8-bit character set but really only
cares about those in the 7-bit ASCII set from a parsing standpoint

* UTF-8 does all of its encoding using combinations of high 8-bit values --
i.e., the bytes of a multibyte character can never be mistaken for ASCII

But two identical utf-8 characters can have different encoding, right? So two strings can contain the same characters but different byte sequences and hence by not be equal.

I don't need full utf-8 support, like comparisons for every character and string but I do need some level of support that allows the use of utf-8, even if the underlying system can't fully support it. Maybe that didn't make sense? What I mean is that it allows strings to be in utf-8 and uses functions which support utf-8, even if only partially. If you need more than the functions used currenltly implement, you just implement it in the function, recompile and test and you don't need to modify anything else.


Chris