[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: unicode char ranges
- From: Jay Carlson <nop@...>
- Date: Wed, 5 Dec 2012 13:46:19 -0500
On Dec 5, 2012, at 11:15 AM, Dirk Laurie wrote:
> 2012/12/5 spir <firstname.lastname@example.org>:
>> I took it for granted that programmers would "naturally" expect that eg "â"
>> would sort outside the [a-z] range, as in latin-1 or with Unicode
>> _precomposed_ codes. Maybe because I have myself eaten too much latin-1. If
>> I understand Dirk's method and proposal, he takes the opposite for granted,
>> namely that composite letters (like "â") should sort with (or just after)
>> their corresponding simple letter (here "a").
Sorting strings *for humans* is locale-dependent. Swedish alphabetical order ends "XYZÅÄÖ". German sorts "Ä" as if it were "A"...unless you're ordering personal names, in which case it orders as "AE". Sorry. When I learned Spanish, "ll" was treated as the letter after "l". Hey, at least it's not Welsh.
If you're sorting for, say, binary trees, all that matters is that "identical" inputs compare the same.
> There are two reasons for doing it my way:
> 1. People whose mother tongue is English think of accents as optional
> extras that only foreigners use. If you Google for the most famous
> dissident leader in Communist Poland, on the first two pages of hits
> only the Wikipedia/Wikiquote sites put in the accents on the
> `l` and `e`. (On page 3 finally a Polish site is hit.)
I think this is reasonable behavior for search, although I'd bet there's some information retrieval literature on how to best weight results.
>> * Illustrate the consequence with examples such as "â" in decomposed
>> and precomposed forms.
> If the whole point of your proposed library is to treat e.g â (u00E2)
> and â (u0061u0302) as denoting the same character, then I do not have
> anything to contribute.
If you're using Unicode, they denote the same character. Completing the syllogism.... :-)
Life is usually easier in NFC. In fact, Perl6 was proposing to collapse composed sequences without a single precomposed character into internal-only dynamically allocated codepoints past the end of plane U+10xxxx, and this has some things to recommend it.
> They look visibly different in all three
> fonts currently on my screen (the second form looks worse in
> three different ways: accent too small, accent too high, accent
> too low).
Here's a nickel. Get yourself a real operating system (or perhaps just a real MUA).
>> Thus, I guess overloading char ranges with systematic gsub+mapping
>> may multiply parsing times.
> For any serious parsing work, Lua patterns are clumsy and Roberto's
> lpeg module really becomes essential.
It's not that hard to fix Lua patterns to treat UTF-8-encoded codepoints as single "characters"--although the original problem, [\u1000-\u20ff], is harder. But you rarely want to do just some range; usually you mean something like "find me all the Cyrillic" or "replace all runs of punctuation and whitespace with a single space".