lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


2018-07-11 15:10 GMT+02:00 Hisham <h@hisham.hm>:
> On 11 July 2018 at 03:43, Dirk Laurie <dirk.laurie@gmail.com> wrote:
>> There is an obvious analogy between codons and characters,already
>> exploited in the names of the functions utf8.char and utf8.len. The
>> analogy defines what the (presently non-existent) functions utf8.find,
>> utf8.sub, utf8.match, utf8.reverse, utf8.rep, utf8.gsub and
>> utf8.gmatch should mean.
>
> A few years ago I went through the exercise of reworking the core
> pattern matching function of the Lua string library (which powers
> string.match, string.gsub, string.gmatch) to work on UTF-8 codepoints
> instead of bytes. My goal was to see if it was a small enough addition
> to have a shot at being asked for inclusion in the library. I believe
> I did get it working, if my memory serves me right. Patch follows
> attached.
>
> In the end, what turned me off about the idea was that the predefined
> character classes such as %a and %d would be either unavailable or
> misleading/incompatible — they couldn't be Unicode-based because we're
> not supporting Unicode (just UTF-8), but they aren't ASCII-based
> either, because the ones in string.match are affected by setlocale.
>
> Ultimately, the problem is: you would expect utf8.match("name:
> %a*%d+", "name: Hélène123") to work, but that doesn't seem feasible to
> do without adding Unicode knowledge.

I would not like to tamper with existing classes, but one could introduce
definable character classes, almost like having a metatable for patterns.

Suppose we had this function:

string.class(lc,test)
  lc is a character class not currently defined e.g. "%y"
  test(str) is a function that returns the matching substring if str
     starts with a substring of that class,yes, otherwise nil

Then the user can add whatever Unicode knowledge is needed
without clutter.