lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


hi


Am 29.06.2011 um 23:06 schrieb Edgar Toernig <froese@gmx.de>:
> There's already a UTF-8 version of the string library called slnunicode.
> Iirc, it uses the tables from Tcl which are about 13k.  The whole library
> is about 32k.
btw that covers only the BMP but I guess there is not much need
for a fairy tale character class or uppercased unicorns.
A gsub('<pile of poo>+', '<rainbow>') should work tho.

>    unicode.utf8.gsub(s, "[\uf000-\uffff]", "?")

The magic u function is basically itself a utf8.gsub('%\(%x+)', utf8.char).
Note it replaces \xxxx not \uxxxx.
You have to use \\ (or [[) to get \ so it would be u"\\f000-\\ffff".

Anyway, as I am struggling to lift slnunicode to latest 5.2,
I think I may add such a function in C and especially
use it on all patterns automatically,
since you should escape a literal \ as %\ in patterns anyway.
Thoughts?
I guess besides patterns there is not much need to specify characters
by code values.


And BTW, unicode is not 100% locale independent even in the basic
features as provided by string.
upper('i') has to yield an I with dot in the turkish locale.

Collation is highly locale dependent, but that's another story.
I'll probably just expose libc's strcoll/strxfrm.


best
Klaus