[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: UTF-8 patterns in Lua 5.3
- From: Keith Matthews <keith.l.matthews@...>
- Date: Thu, 17 Apr 2014 20:55:36 -0400
On Wed, Apr 16, 2014 at 9:09 PM, Hisham <h@hisham.hm> wrote:
> On 16 April 2014 18:11, Keith Matthews <keith.l.matthews@gmail.com> wrote:
>> On Wed, Apr 16, 2014 at 2:09 AM, Hisham <h@hisham.hm> wrote:
>>>
>>> Recent threads here on lua-l and discussion on Twitter about the
>>> necessity of including UTF-8 support into core Lua (as opposed to a
>>> library) got me thinking about how hard would it be to get proper
>>> UTF-8 support in Lua patterns.
>>>
>>> The idea is to avoid things like this:
>>>
>>> Lua 5.2.3 Copyright (C) 1994-2013 Lua.org, PUC-Rio
>>> > print( ("páscoa"):match("[é]") )
>>> Ã
>>> > print( ("páscoa"):match("[^é]*$") )
>>> ¡scoa
>>> > print( ("época"):match("[á-ú].") )
>>> é
>>>
>>> To get these things to work we need more than utf8.charpatt that Lua
>>> 5.3 provides; (utf8.charpatt can only match one character, we can't
>>> even use "*" with it).
>>
>> Once you go down that road, you also need to add Unicode normalization
>> functions. Accented letters will not match properly unless both the
>> string and the pattern use the same normalization form.
>>
>> For example, ("é"):match("é") will fail if the first "é" is code point
>> U+00C9 and the second one is the combination of code points U+0065 and
>> U+0301. This kind of problem may arise when matching text from a file
>> made on another computer with a different keyboard mapping.
>>
>> Normalization forms are described in the Unicode Standard Annex 15:
>> http://unicode.org/reports/tr15/
>
> Well, since Lua 5.3 is poised to include UTF-8 support and not
> Unicode, that's out of scope right from the start. The suggestion here
> has a well-defined target: to optionally extend the notion of
> character in a pattern from a byte to a UTF-8 codepoint. I think this
> would be in line with current UTF-8 support in the core (with \u{}
> notation and all) and would make Lua patterns more useful:
It is indeed a well defined target. Don't get me wrong, the computer
scientist side of me would love to see pattern matching with support
for UTF-8 encoded Unicode code points in Lua. It would be useful for
low-level manipulation of UTF-8 data. However, my software engineer
side thinks that it's not worth opening that can of worms. For
example, in your original message you said that you wanted to avoid
this:
(1) > print( ("páscoa"):match("[^é]*$") )
¡scoa
I applied your patch to Lua 5.3 work 2, and copied your example:
(2) > print( ("páscoa"):match("[^é]*$") )
páscoa
Great! The patch works. I carefully typed the same example by hand:
(3) > print( ("páscoa"):match("[^é]*$") )
scoa
See the difference? It looks the same, but I used a combining acute
accent (U+0301) for á and é. So even with your patch, you don't
avoid this problem: it was only pushed to another abstraction layer.
Example (1) fails because string.match works at the byte level while
"é" is composed of two bytes, and example (3) fails because the
patched string.match works at the code point level while "é" is
composed of two code points.
In a way, this is even worse: a programmer will likely type the code
and write the unit tests with the same text editor, using precomposed
characters, and everything will appear to work. The bug will only crop
up much later when the pattern is used on real-world data that
contains combining characters.
Keith