[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: The Lua utf8 library (Was: Issues: Character 160 ...)
- From: Dirk Laurie <dirk.laurie@...>
- Date: Wed, 11 Jul 2018 16:12:14 +0200
2018-07-11 15:10 GMT+02:00 Hisham <firstname.lastname@example.org>:
> On 11 July 2018 at 03:43, Dirk Laurie <email@example.com> wrote:
>> There is an obvious analogy between codons and characters,already
>> exploited in the names of the functions utf8.char and utf8.len. The
>> analogy defines what the (presently non-existent) functions utf8.find,
>> utf8.sub, utf8.match, utf8.reverse, utf8.rep, utf8.gsub and
>> utf8.gmatch should mean.
> A few years ago I went through the exercise of reworking the core
> pattern matching function of the Lua string library (which powers
> string.match, string.gsub, string.gmatch) to work on UTF-8 codepoints
> instead of bytes. My goal was to see if it was a small enough addition
> to have a shot at being asked for inclusion in the library. I believe
> I did get it working, if my memory serves me right. Patch follows
> In the end, what turned me off about the idea was that the predefined
> character classes such as %a and %d would be either unavailable or
> misleading/incompatible — they couldn't be Unicode-based because we're
> not supporting Unicode (just UTF-8), but they aren't ASCII-based
> either, because the ones in string.match are affected by setlocale.
> Ultimately, the problem is: you would expect utf8.match("name:
> %a*%d+", "name: Hélène123") to work, but that doesn't seem feasible to
> do without adding Unicode knowledge.
I would not like to tamper with existing classes, but one could introduce
definable character classes, almost like having a metatable for patterns.
Suppose we had this function:
lc is a character class not currently defined e.g. "%y"
test(str) is a function that returns the matching substring if str
starts with a substring of that class,yes, otherwise nil
Then the user can add whatever Unicode knowledge is needed