[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: unicode char ranges
- From: Dirk Laurie <dirk.laurie@...>
- Date: Wed, 5 Dec 2012 18:15:03 +0200
2012/12/5 spir <denis.spir@gmail.com>:
> I took it for granted that programmers would "naturally" expect that eg "â"
> would sort outside the [a-z] range, as in latin-1 or with Unicode
> _precomposed_ codes. Maybe because I have myself eaten too much latin-1. If
> I understand Dirk's method and proposal, he takes the opposite for granted,
> namely that composite letters (like "â") should sort with (or just after)
> their corresponding simple letter (here "a").
There are two reasons for doing it my way:
1. People whose mother tongue is English think of accents as optional
extras that only foreigners use. If you Google for the most famous
dissident leader in Communist Poland, on the first two pages of hits
only the Wikipedia/Wikiquote sites put in the accents on the
`l` and `e`. (On page 3 finally a Polish site is hit.)
2. The Lua string library is byte-oriented. I can search for five-letter
proper names with "%u%l%l%l%l" and find "Emile" but not "Émile".
> * Illustrate the consequence with examples such as "â" in decomposed
> and precomposed forms.
If the whole point of your proposed library is to treat e.g â (u00E2)
and â (u0061u0302) as denoting the same character, then I do not have
anything to contribute. They look visibly different in all three
fonts currently on my screen (the second form looks worse in
three different ways: accent too small, accent too high, accent
too low).
> Thus, I guess overloading char ranges with systematic gsub+mapping
> may multiply parsing times.
For any serious parsing work, Lua patterns are clumsy and Roberto's
lpeg module really becomes essential.