lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

On 11 July 2018 at 03:43, Dirk Laurie <> wrote:
> There is an obvious analogy between codons and characters,already
> exploited in the names of the functions utf8.char and utf8.len. The
> analogy defines what the (presently non-existent) functions utf8.find,
> utf8.sub, utf8.match, utf8.reverse, utf8.rep, utf8.gsub and
> utf8.gmatch should mean.

A few years ago I went through the exercise of reworking the core
pattern matching function of the Lua string library (which powers
string.match, string.gsub, string.gmatch) to work on UTF-8 codepoints
instead of bytes. My goal was to see if it was a small enough addition
to have a shot at being asked for inclusion in the library. I believe
I did get it working, if my memory serves me right. Patch follows

In the end, what turned me off about the idea was that the predefined
character classes such as %a and %d would be either unavailable or
misleading/incompatible — they couldn't be Unicode-based because we're
not supporting Unicode (just UTF-8), but they aren't ASCII-based
either, because the ones in string.match are affected by setlocale.

Ultimately, the problem is: you would expect utf8.match("name:
%a*%d+", "name: Hélène123") to work, but that doesn't seem feasible to
do without adding Unicode knowledge.

-- Hisham

Attachment: lua-5.3.0-work2-utf8patterns.patch
Description: Binary data