lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Hello,

I have just uploaded the first preliminary version of a new binary Lua
module for handling Unicode to LuaForge. It depends on the
International Components for Unicode (ICU) - see
<http://icu-project.org/> if you are not familiar with it.

The project files page is here:
<http://luaforge.net/frs/?group_id=460>

(In order to use the DLL, you will need the ICU 4.0 Win32 binaries
from <http://icu-project.org/download/4.0.html>)

At this early stage, all that I am trying to do is provide equivalent
functionality to that provided by the standard 'string' library. In
the future, I hope to expand it to include more of ICU's own features,
but for now I think focusing on this goal is sensible enough. Right
now even that is a little way off, though, as I haven't yet managed to
complete equivalents for
any of the functions for matching (string.match, string.find,
string.gmatch, string.gsub) or formatting (string.format). Hence, this
project is very much alpha-stage.

All of the functionality is currently in two submodules, 'icu.utf8'
and 'icu.ustring'. The 'icu' module itself is otherwise empty.

icu.utf8 provides functions which operate on Lua byte-strings on the
assumption that they contain UTF8-encoded text:
	icu.utf8.len
	icu.utf8.rep
	icu.utf8.sub
	icu.utf8.reverse
	icu.utf8.upper
	icu.utf8.lower
	icu.utf8.codepoint	(NOTE: equivalent to string.byte)
	icu.utf8.char
All functions have the same parameters as usual, except upper and
lower which both take an optional second parameter to specify the
locale. Also included is icu.utf8.bom, which is simply a Lua string
containing the UTF-8 byte order mark (\xEF\xBB\xBF).

icu.ustring is a more heavyweight approach, providing a userdata-based
replacement for byte-strings - internally, a ustring is an array of
16-bit ICU UChars. To create a ustring, you will need to call
icu.ustring.decode(), which takes a byte-string as its first
parameter, and an optional second parameter to specify the encoding,
e.g. "windows-1252". The default encoding is UTF-8. The equivalent
function icu.ustring.encode() takes a ustring as its first parameter
and an optional encoding, and returns the encoded byte-string. Calling
tostring() on a ustring encodes it with the default encoding.

All ustrings get interned into a string pool when they are created, so
two ustrings with the same value (code unit-wise) will already be
identical when tested for equality. (This is not the same thing as
Unicode equality - there is no proper collation support yet.)

Once you have created a ustring, you can use the same
string-library-subset functions provided in icu.utf8 on them:
	icu.ustring.len
	icu.ustring.rep
	icu.ustring.sub
	icu.ustring.reverse
	icu.ustring.upper
	icu.ustring.lower
	icu.ustring.codepoint
	icu.ustring.char
ustrings can also be concatenated with each other, and the length
operator # can be used instead of icu.ustring.len. In addition, you
can use any of the ustring functions as methods on a ustring instance.

I would greatly appreciate any feedback about this, and would love to
hear from anyone who has suggestions about where it should ideally go
next (well, after matching and formatting are added in) or finds any
problems with it, including subtle differences between how these
functions work and the standard string module's. The binaries are
Win32-centric but I don't believe the actual source code is - if
anyone would like to look into making sure it works on other platforms
I would be grateful for that.


Thanks for reading,

-Duncan