lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


That's great, many thanks !

On Mon, May 4, 2009 at 11:46 PM, Duncan Cross <duncan.cross@gmail.com> wrote:
> Hello,
>
> I have just uploaded the first preliminary version of a new binary Lua
> module for handling Unicode to LuaForge. It depends on the
> International Components for Unicode (ICU) - see
> <http://icu-project.org/> if you are not familiar with it.
>
> The project files page is here:
> <http://luaforge.net/frs/?group_id=460>
>
> (In order to use the DLL, you will need the ICU 4.0 Win32 binaries
> from <http://icu-project.org/download/4.0.html>)
>
> At this early stage, all that I am trying to do is provide equivalent
> functionality to that provided by the standard 'string' library. In
> the future, I hope to expand it to include more of ICU's own features,
> but for now I think focusing on this goal is sensible enough. Right
> now even that is a little way off, though, as I haven't yet managed to
> complete equivalents for
> any of the functions for matching (string.match, string.find,
> string.gmatch, string.gsub) or formatting (string.format). Hence, this
> project is very much alpha-stage.
>
> All of the functionality is currently in two submodules, 'icu.utf8'
> and 'icu.ustring'. The 'icu' module itself is otherwise empty.
>
> icu.utf8 provides functions which operate on Lua byte-strings on the
> assumption that they contain UTF8-encoded text:
>        icu.utf8.len
>        icu.utf8.rep
>        icu.utf8.sub
>        icu.utf8.reverse
>        icu.utf8.upper
>        icu.utf8.lower
>        icu.utf8.codepoint      (NOTE: equivalent to string.byte)
>        icu.utf8.char
> All functions have the same parameters as usual, except upper and
> lower which both take an optional second parameter to specify the
> locale. Also included is icu.utf8.bom, which is simply a Lua string
> containing the UTF-8 byte order mark (\xEF\xBB\xBF).
>
> icu.ustring is a more heavyweight approach, providing a userdata-based
> replacement for byte-strings - internally, a ustring is an array of
> 16-bit ICU UChars. To create a ustring, you will need to call
> icu.ustring.decode(), which takes a byte-string as its first
> parameter, and an optional second parameter to specify the encoding,
> e.g. "windows-1252". The default encoding is UTF-8. The equivalent
> function icu.ustring.encode() takes a ustring as its first parameter
> and an optional encoding, and returns the encoded byte-string. Calling
> tostring() on a ustring encodes it with the default encoding.
>
> All ustrings get interned into a string pool when they are created, so
> two ustrings with the same value (code unit-wise) will already be
> identical when tested for equality. (This is not the same thing as
> Unicode equality - there is no proper collation support yet.)
>
> Once you have created a ustring, you can use the same
> string-library-subset functions provided in icu.utf8 on them:
>        icu.ustring.len
>        icu.ustring.rep
>        icu.ustring.sub
>        icu.ustring.reverse
>        icu.ustring.upper
>        icu.ustring.lower
>        icu.ustring.codepoint
>        icu.ustring.char
> ustrings can also be concatenated with each other, and the length
> operator # can be used instead of icu.ustring.len. In addition, you
> can use any of the ustring functions as methods on a ustring instance.
>
> I would greatly appreciate any feedback about this, and would love to
> hear from anyone who has suggestions about where it should ideally go
> next (well, after matching and formatting are added in) or finds any
> problems with it, including subtle differences between how these
> functions work and the standard string module's. The binaries are
> Win32-centric but I don't believe the actual source code is - if
> anyone would like to look into making sure it works on other platforms
> I would be grateful for that.
>
>
> Thanks for reading,
>
> -Duncan
>



-- 
Bertrand Mansion
Mamasam