lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On 16 April 2014 18:11, Keith Matthews <keith.l.matthews@gmail.com> wrote:
> On Wed, Apr 16, 2014 at 2:09 AM, Hisham <h@hisham.hm> wrote:
>>
>> Recent threads here on lua-l and discussion on Twitter about the
>> necessity of including UTF-8 support into core Lua (as opposed to a
>> library) got me thinking about how hard would it be to get proper
>> UTF-8 support in Lua patterns.
>>
>> The idea is to avoid things like this:
>>
>> Lua 5.2.3  Copyright (C) 1994-2013 Lua.org, PUC-Rio
>> > print( ("páscoa"):match("[é]") )
>> Ã
>> > print( ("páscoa"):match("[^é]*$") )
>> ¡scoa
>> > print( ("época"):match("[á-ú].") )
>> é
>>
>> To get these things to work we need more than utf8.charpatt that Lua
>> 5.3 provides; (utf8.charpatt can only match one character, we can't
>> even use "*" with it).
>
> Once you go down that road, you also need to add Unicode normalization
> functions. Accented letters will not match properly unless both the
> string and the pattern use the same normalization form.
>
> For example, ("é"):match("é") will fail if the first "é" is code point
> U+00C9 and the second one is the combination of code points U+0065 and
> U+0301. This kind of problem may arise when matching text from a file
> made on another computer with a different keyboard mapping.
>
> Normalization forms are described in the Unicode Standard Annex 15:
> http://unicode.org/reports/tr15/

Well, since Lua 5.3 is poised to include UTF-8 support and not
Unicode, that's out of scope right from the start. The suggestion here
has a well-defined target: to optionally extend the notion of
character in a pattern from a byte to a UTF-8 codepoint. I think this
would be in line with current UTF-8 support in the core (with \u{}
notation and all) and would make Lua patterns more useful:

Lua 5.3.0 (work2)  Copyright (C) 1994-2014 Lua.org, PUC-Rio
> text="Because I said I'd use it as a test case: Привет, мир!!!"
> for word in text:gmatch("[\u{400}-\u{4ff}]+") do print(word) end
Привет
мир

While I'm at it, I made the patch a bit shorter, since we only have to
deal with up to 4 bytes at a time:

Updated at
http://hisham.hm/tmp/lua-5.3.0-work2-utf8patterns.patch
and https://gist.github.com/hishamhm/10814558

-- Hisham