lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

David Kolf wrote:
> I wonder how compact you can store the character classes for the 65k
> codepoints in the BMP and the lowercase/uppercase pairs (for
> string.lower, string.upper).

There's already a UTF-8 version of the string library called slnunicode.
Iirc, it uses the tables from Tcl which are about 13k.  The whole library
is about 32k.

With it and the unicode escape sequences you could write i.e.

    unicode.utf8.gsub(s, "[\uf000-\uffff]", "?")

With Lua 5.1 that would be

    unicode.utf8.gsub(s, "[\239\128\128-\239\191\191]", "?")

Now, which one is cleaner?

> Maybe that can be compressed far enough to be included in official Lua
> (5.3?). That would be great.

I think that's not really necessary.  You need both versions anyway, the
simple byte-oriented variant to parse and match arbitrary bytes sequences
(incl. binary data) and the UTF-8 version for unicode character strings.
An external library would be good enough.  But you want the escape sequences
to make the external library a pleasance to use (s.a.).

Ciao, ET.