Re: UTF-8 patterns in Lua 5.3

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: UTF-8 patterns in Lua 5.3
From: Keith Matthews <keith.l.matthews@...>
Date: Wed, 16 Apr 2014 17:11:14 -0400

On Wed, Apr 16, 2014 at 2:09 AM, Hisham <h@hisham.hm> wrote:
>
> Recent threads here on lua-l and discussion on Twitter about the
> necessity of including UTF-8 support into core Lua (as opposed to a
> library) got me thinking about how hard would it be to get proper
> UTF-8 support in Lua patterns.
>
> The idea is to avoid things like this:
>
> Lua 5.2.3  Copyright (C) 1994-2013 Lua.org, PUC-Rio
> > print( ("páscoa"):match("[é]") )
> Ã
> > print( ("páscoa"):match("[^é]*$") )
> ¡scoa
> > print( ("época"):match("[á-ú].") )
> é
>
> To get these things to work we need more than utf8.charpatt that Lua
> 5.3 provides; (utf8.charpatt can only match one character, we can't
> even use "*" with it).

Once you go down that road, you also need to add Unicode normalization
functions. Accented letters will not match properly unless both the
string and the pattern use the same normalization form.

For example, ("é"):match("é") will fail if the first "é" is code point
U+00C9 and the second one is the combination of code points U+0065 and
U+0301. This kind of problem may arise when matching text from a file
made on another computer with a different keyboard mapping.

Normalization forms are described in the Unicode Standard Annex 15:
http://unicode.org/reports/tr15/

Keith

Follow-Ups:
- Re: UTF-8 patterns in Lua 5.3, Hisham

References:
- UTF-8 patterns in Lua 5.3, Hisham

Prev by Date: Re: A guide to building Lua modules
Next by Date: Re: Monkey Patching is Bad, Unless Really Needed (was Re: A guide to building Lua modules)
Previous by thread: Re: UTF-8 patterns in Lua 5.3
Next by thread: Re: UTF-8 patterns in Lua 5.3
Index(es):
- Date
- Thread