lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On 11 July 2018 at 03:43, Dirk Laurie <dirk.laurie@gmail.com> wrote:
> There is an obvious analogy between codons and characters,already
> exploited in the names of the functions utf8.char and utf8.len. The
> analogy defines what the (presently non-existent) functions utf8.find,
> utf8.sub, utf8.match, utf8.reverse, utf8.rep, utf8.gsub and
> utf8.gmatch should mean.

A few years ago I went through the exercise of reworking the core
pattern matching function of the Lua string library (which powers
string.match, string.gsub, string.gmatch) to work on UTF-8 codepoints
instead of bytes. My goal was to see if it was a small enough addition
to have a shot at being asked for inclusion in the library. I believe
I did get it working, if my memory serves me right. Patch follows
attached.

In the end, what turned me off about the idea was that the predefined
character classes such as %a and %d would be either unavailable or
misleading/incompatible — they couldn't be Unicode-based because we're
not supporting Unicode (just UTF-8), but they aren't ASCII-based
either, because the ones in string.match are affected by setlocale.

Ultimately, the problem is: you would expect utf8.match("name:
%a*%d+", "name: Hélène123") to work, but that doesn't seem feasible to
do without adding Unicode knowledge.

-- Hisham

Attachment: lua-5.3.0-work2-utf8patterns.patch
Description: Binary data