lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Recent threads here on lua-l and discussion on Twitter about the
necessity of including UTF-8 support into core Lua (as opposed to a
library) got me thinking about how hard would it be to get proper
UTF-8 support in Lua patterns.

The idea is to avoid things like this:

Lua 5.2.3  Copyright (C) 1994-2013 Lua.org, PUC-Rio
> print( ("páscoa"):match("[é]") )
Ã
> print( ("páscoa"):match("[^é]*$") )
¡scoa
> print( ("época"):match("[á-ú].") )
é

To get these things to work we need more than utf8.charpatt that Lua
5.3 provides; (utf8.charpatt can only match one character, we can't
even use "*" with it).

So I went to see how little would one have to change in the pattern
engine to get it to work.

I whipped up this proof-of-concept patch:

http://hisham.hm/tmp/lua-5.3.0-work2-utf8patterns.patch
or https://gist.github.com/hishamhm/10814558

I didn't want to rewrite the pattern engine; I wanted to keep the code
as similar to the original as possible. Another goal was to have it
still work in both modes, while keeping the code small.

I didn't bother make the patch feature-complete because it was more an
exploratory proof-of-concept (and the Lua team doesn't take patches
anyway), which means it was barely tested and there are some major
caveats:

* while I wrote the code to make the engine work in both modes, it is
not unconditionally set to UTF-8 (ie, I didn't make changes in the
string.* Lua functions to add an extra argument to set UTF-8 mode).
* the current code works for UTF-8 characters up to 4-bytes long; IIRC
UTF-8 sequences are up to 6-bytes long; this is easily expandable
changing the variables used to store characters to 64-bit integers, or
rewriting the comparison code in a few places.
* I didn't make considerations or measurements wrt performance.
Whatever the results are, they could certainly be improved had I
allowed myself to stray further from the original code.

A design limitation that's not really an implementation caveat is that
"%" classes (%a, etc.) do not work in UTF-8 mode because these map to
Unicode concepts which are beyond the scope of Lua 5.3.

Still, it was nice to see patterns producing UTF-8-correct results:

Lua 5.3.0 (work2)  Copyright (C) 1994-2014 Lua.org, PUC-Rio
>  print( ("páscoa"):match("[é]") )
nil
>  print( ("páscoa"):match("[^é]*$") )
páscoa
> ("época"):match("[á-ú].")
ép

Again, this was barely tested so don't use it for real work; the goal
was to stir conversation about the value and feasibility of having
UTF-8 patterns in Lua 5.3.

-- Hisham