[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Plea for the support of unicode escape sequences
- From: Klaus Ripke <paul-lua@...>
- Date: Sun, 3 Jul 2011 14:33:19 +0200
hi
Am 29.06.2011 um 23:06 schrieb Edgar Toernig <froese@gmx.de>:
> There's already a UTF-8 version of the string library called slnunicode.
> Iirc, it uses the tables from Tcl which are about 13k. The whole library
> is about 32k.
btw that covers only the BMP but I guess there is not much need
for a fairy tale character class or uppercased unicorns.
A gsub('<pile of poo>+', '<rainbow>') should work tho.
> unicode.utf8.gsub(s, "[\uf000-\uffff]", "?")
The magic u function is basically itself a utf8.gsub('%\(%x+)', utf8.char).
Note it replaces \xxxx not \uxxxx.
You have to use \\ (or [[) to get \ so it would be u"\\f000-\\ffff".
Anyway, as I am struggling to lift slnunicode to latest 5.2,
I think I may add such a function in C and especially
use it on all patterns automatically,
since you should escape a literal \ as %\ in patterns anyway.
Thoughts?
I guess besides patterns there is not much need to specify characters
by code values.
And BTW, unicode is not 100% locale independent even in the basic
features as provided by string.
upper('i') has to yield an I with dot in the turkish locale.
Collation is highly locale dependent, but that's another story.
I'll probably just expose libc's strcoll/strxfrm.
best
Klaus