[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: UTF-8 patterns in Lua 5.3
- From: Hisham <h@...>
- Date: Fri, 18 Apr 2014 20:50:42 -0300
On 17 April 2014 21:55, Keith Matthews <keith.l.matthews@gmail.com> wrote:
> On Wed, Apr 16, 2014 at 9:09 PM, Hisham <h@hisham.hm> wrote:
>> On 16 April 2014 18:11, Keith Matthews <keith.l.matthews@gmail.com> wrote:
>>> On Wed, Apr 16, 2014 at 2:09 AM, Hisham <h@hisham.hm> wrote:
>>>>
>>>> Recent threads here on lua-l and discussion on Twitter about the
>>>> necessity of including UTF-8 support into core Lua (as opposed to a
>>>> library) got me thinking about how hard would it be to get proper
>>>> UTF-8 support in Lua patterns.
>>>>
>>>> The idea is to avoid things like this:
>>>>
>>>> Lua 5.2.3 Copyright (C) 1994-2013 Lua.org, PUC-Rio
>>>> > print( ("páscoa"):match("[é]") )
>>>> Ã
>>>> > print( ("páscoa"):match("[^é]*$") )
>>>> ¡scoa
>>>> > print( ("época"):match("[á-ú].") )
>>>> é
>>>>
>>>> To get these things to work we need more than utf8.charpatt that Lua
>>>> 5.3 provides; (utf8.charpatt can only match one character, we can't
>>>> even use "*" with it).
>>>
>>> Once you go down that road, you also need to add Unicode normalization
>>> functions. Accented letters will not match properly unless both the
>>> string and the pattern use the same normalization form.
>>>
>>> For example, ("é"):match("é") will fail if the first "é" is code point
>>> U+00C9 and the second one is the combination of code points U+0065 and
>>> U+0301. This kind of problem may arise when matching text from a file
>>> made on another computer with a different keyboard mapping.
>>>
>>> Normalization forms are described in the Unicode Standard Annex 15:
>>> http://unicode.org/reports/tr15/
>>
>> Well, since Lua 5.3 is poised to include UTF-8 support and not
>> Unicode, that's out of scope right from the start. The suggestion here
>> has a well-defined target: to optionally extend the notion of
>> character in a pattern from a byte to a UTF-8 codepoint. I think this
>> would be in line with current UTF-8 support in the core (with \u{}
>> notation and all) and would make Lua patterns more useful:
>
> It is indeed a well defined target. Don't get me wrong, the computer
> scientist side of me would love to see pattern matching with support
> for UTF-8 encoded Unicode code points in Lua. It would be useful for
> low-level manipulation of UTF-8 data. However, my software engineer
> side thinks that it's not worth opening that can of worms.
In that case, if one starts from the assumption that people will
mistake UTF-8 for Unicode and mishandle Unicode by using UTF-8, what's
the point of having any UTF-8 support in core Lua at all?
I started from the assumption that UTF-8 support in Lua meant
low-level (as in codepoint-level) UTF-8 manipulation and nothing else.
I was just trying to assess how complete this UTF-8 support could be.
-- Hisham