lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Wed, Apr 16, 2014 at 2:09 AM, Hisham <h@hisham.hm> wrote:
>
> Recent threads here on lua-l and discussion on Twitter about the
> necessity of including UTF-8 support into core Lua (as opposed to a
> library) got me thinking about how hard would it be to get proper
> UTF-8 support in Lua patterns.
>
> The idea is to avoid things like this:
>
> Lua 5.2.3  Copyright (C) 1994-2013 Lua.org, PUC-Rio
> > print( ("páscoa"):match("[é]") )
> Ã
> > print( ("páscoa"):match("[^é]*$") )
> ¡scoa
> > print( ("época"):match("[á-ú].") )
> é
>
> To get these things to work we need more than utf8.charpatt that Lua
> 5.3 provides; (utf8.charpatt can only match one character, we can't
> even use "*" with it).

Once you go down that road, you also need to add Unicode normalization
functions. Accented letters will not match properly unless both the
string and the pattern use the same normalization form.

For example, ("é"):match("é") will fail if the first "é" is code point
U+00C9 and the second one is the combination of code points U+0065 and
U+0301. This kind of problem may arise when matching text from a file
made on another computer with a different keyboard mapping.

Normalization forms are described in the Unicode Standard Annex 15:
http://unicode.org/reports/tr15/

Keith