[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: UTF-8 patterns in Lua 5.3
- From: Dirk Laurie <dirk.laurie@...>
- Date: Thu, 17 Apr 2014 07:32:50 +0200
2014-04-17 3:09 GMT+02:00 Hisham <h@hisham.hm>:
> On 16 April 2014 18:11, Keith Matthews <keith.l.matthews@gmail.com> wrote:
>>> To get these things to work we need more than utf8.charpatt that Lua
>>> 5.3 provides; (utf8.charpatt can only match one character, we can't
>>> even use "*" with it).
>>
>> Once you go down that road, you also need to add Unicode normalization
>> functions. Accented letters will not match properly unless both the
>> string and the pattern use the same normalization form.
>>
>> For example, ("é"):match("é") will fail if the first "é" is code point
>> U+00C9 and the second one is the combination of code points U+0065 and
>> U+0301. This kind of problem may arise when matching text from a file
>> made on another computer with a different keyboard mapping.
>>
>> Normalization forms are described in the Unicode Standard Annex 15:
>> http://unicode.org/reports/tr15/
>
> Well, since Lua 5.3 is poised to include UTF-8 support and not
> Unicode, that's out of scope right from the start. The suggestion here
> has a well-defined target: to optionally extend the notion of
> character in a pattern from a byte to a UTF-8 codepoint. I think this
> would be in line with current UTF-8 support in the core (with \u{}
> notation and all) and would make Lua patterns more useful:
For finding words in text, I have been using a mapping from accented
Latin characters into ASCII.The mapping is confined to characters
that combine a single letter from the Latin alphabet with diacritics,
and it does not matter whether one's compose key or one's dead-letter
key added the diacritics.
For example, both representations of é would become e, and the lookup
would find Hélène even if you spelt it Hèléne. Obviously, if a more exact
match is needed, it is easy to add a second round that simply does
a direct equality comparison among the survivors of the first.
This is not so easy to achieve in "pure" Lua. My poor attempt to code it
myself has been described on this list as bogus, wrong, etc. I wish
someone with more knowledge than me had done it — someone who
not only can throw around words like normalization and glyph, but knows
exactly what they mean.
The task is clearly not a candidate for Lua 5.3's utf8 library, perhaps out
of scope even for Hisham's. Maybe not: the feature is needed only for
'find', and it could be seen as a special case of no-magic searching,
triggered by supplyng a particular value for the fourth parameter, say
"latinize".