[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: UTF-8 patterns in Lua 5.3
- From: Hisham <h@...>
- Date: Wed, 16 Apr 2014 03:09:23 -0300
Recent threads here on lua-l and discussion on Twitter about the
necessity of including UTF-8 support into core Lua (as opposed to a
library) got me thinking about how hard would it be to get proper
UTF-8 support in Lua patterns.
The idea is to avoid things like this:
Lua 5.2.3 Copyright (C) 1994-2013 Lua.org, PUC-Rio
> print( ("páscoa"):match("[é]") )
Ã
> print( ("páscoa"):match("[^é]*$") )
¡scoa
> print( ("época"):match("[á-ú].") )
é
To get these things to work we need more than utf8.charpatt that Lua
5.3 provides; (utf8.charpatt can only match one character, we can't
even use "*" with it).
So I went to see how little would one have to change in the pattern
engine to get it to work.
I whipped up this proof-of-concept patch:
http://hisham.hm/tmp/lua-5.3.0-work2-utf8patterns.patch
or https://gist.github.com/hishamhm/10814558
I didn't want to rewrite the pattern engine; I wanted to keep the code
as similar to the original as possible. Another goal was to have it
still work in both modes, while keeping the code small.
I didn't bother make the patch feature-complete because it was more an
exploratory proof-of-concept (and the Lua team doesn't take patches
anyway), which means it was barely tested and there are some major
caveats:
* while I wrote the code to make the engine work in both modes, it is
not unconditionally set to UTF-8 (ie, I didn't make changes in the
string.* Lua functions to add an extra argument to set UTF-8 mode).
* the current code works for UTF-8 characters up to 4-bytes long; IIRC
UTF-8 sequences are up to 6-bytes long; this is easily expandable
changing the variables used to store characters to 64-bit integers, or
rewriting the comparison code in a few places.
* I didn't make considerations or measurements wrt performance.
Whatever the results are, they could certainly be improved had I
allowed myself to stray further from the original code.
A design limitation that's not really an implementation caveat is that
"%" classes (%a, etc.) do not work in UTF-8 mode because these map to
Unicode concepts which are beyond the scope of Lua 5.3.
Still, it was nice to see patterns producing UTF-8-correct results:
Lua 5.3.0 (work2) Copyright (C) 1994-2014 Lua.org, PUC-Rio
> print( ("páscoa"):match("[é]") )
nil
> print( ("páscoa"):match("[^é]*$") )
páscoa
> ("época"):match("[á-ú].")
ép
Again, this was barely tested so don't use it for real work; the goal
was to stir conversation about the value and feasibility of having
UTF-8 patterns in Lua 5.3.
-- Hisham