[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: RE: lua for unicode
- From: james@...
- Date: Sat, 30 Nov 2002 16:05:05 -0600
> Using UTF-8 in Lua is attractive because it is already 8 bit clean.
> Certainly the string lib needs to be updated to deal with a
> character being
> between 1 and 4 bytes, as the UTF-8 spec defines, and there are some
> collation issues that are very complex. (snip)
> Love, Light and Peace,
> - Peter Loveday
> Director of Development, eyeon Software
I've been working on this in a library I've written. I've
been trying to come up with a good way to support 8-bit,
Unicode or UTF8, UTF16, or even UTF32. The solution I'm
trying is the same as Java's, where internally all strings
are stored in UTF16. With this, it's easy to hook up the
right systems calls on a per platform basis, which is needed
since wide character versions of calls like fopen are not
The tough part is figuring out what the user enters, since
on most platforms a library has to deal with a mix of encoding
types. For example out of all the various Windows platforms, I
can have 8 bit codepage mapped, ASCII, or Unicode for input
character data. I still haven't solved this problem. There's
no solid way to detect the incoming string type on the fly,
so it looks like the library will have to handle it one of
1) Force the user to adhere to UTF16 on all input for
2) Try to detect the string width on init and assume the user
is doing the same.
3) Hard code string width via a compile option.
I'm probably going to use #1. In most cases, at least on
windows, it's fairly easy to convert from char* to wchar_t*
and use macros like L"string" when defining string constants.
I can't support non Unicode systems though, like Win98.
(oh well :) It's probably time we all started moving toward
16 bit code points anyway. And since Unicode maps into
UTF16, and ASCII maps directly into UTF16, it all works pretty
On the string and io Lua libs - these have the same problem
since they make calls to things like fopen, fgets, strlen, etc..
and accept char* pointers. They will require a little work to
get them to run on systems which are Unicode. Overall though,
probably not a big deal since people can port them to their
particular platform fairly easily.
As for the core Lua lib, it's mostly string width independent
I believe, although a quick search across the source produces a
number of calls to things like strlen in routines like
lua_pushstring, lua_dostring, luaO_chunkid, luaS_new, and
luaV_strcomp. (Maybe these should be memcmp's instead??)
Hopefully if you stay away from some of the ASCII api calls like
lua_pushstring, lua_dostring everything works on a system that
use Unicode or UTF16?