lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

> Using UTF-8 in Lua is attractive because it is already 8 bit clean.
> Certainly the string lib needs to be updated to deal with a 
> character being
> between 1 and 4 bytes, as the UTF-8 spec defines, and there are some
> collation issues that are very complex.  (snip)
> Love, Light and Peace,
> - Peter Loveday
> Director of Development, eyeon Software

I've been working on this in a library I've written. I've 
been trying to come up with a good way to support 8-bit, 
Unicode or UTF8, UTF16, or even UTF32. The solution I'm
trying is the same as Java's, where internally all strings 
are stored in UTF16. With this, it's easy to hook up the 
right systems calls on a per platform basis, which is needed
since wide character versions of calls like fopen are not

The tough part is figuring out what the user enters, since
on most platforms a library has to deal with a mix of encoding 
types. For example out of all the various Windows platforms, I 
can have 8 bit codepage mapped, ASCII, or Unicode for input 
character data. I still haven't solved this problem. There's
no solid way to detect the incoming string type on the fly, 
so it looks like the library will have to handle it one of
two ways:

1) Force the user to adhere to UTF16 on all input for 
the library.
2) Try to detect the string width on init and assume the user
is doing the same.
3) Hard code string width via a compile option.

I'm probably going to use #1. In most cases, at least on
windows, it's fairly easy to convert from char* to wchar_t*
and use macros like L"string" when defining string constants. 
I can't support non Unicode systems though, like Win98. 
(oh well :) It's probably time we all started moving toward
16 bit code points anyway. And since Unicode maps into 
UTF16, and ASCII maps directly into UTF16, it all works pretty

On the string and io Lua libs - these have the same problem 
since they make calls to things like fopen, fgets, strlen, etc.. 
and accept char* pointers. They will require a little work to 
get them to run on systems which are Unicode. Overall though, 
probably not a big deal since people can port them to their 
particular platform fairly easily. 

As for the core Lua lib, it's mostly string width independent
I believe, although a quick search across the source produces a 
number of calls to things like strlen in routines like 
lua_pushstring, lua_dostring, luaO_chunkid, luaS_new, and 
luaV_strcomp. (Maybe these should be memcmp's instead??) 
Hopefully if you stay away from some of the ASCII api calls like
lua_pushstring, lua_dostring everything works on a system that 
use Unicode or UTF16?