Re: unicode char ranges

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: unicode char ranges
From: Dirk Laurie <dirk.laurie@...>
Date: Wed, 5 Dec 2012 18:15:03 +0200

2012/12/5 spir <denis.spir@gmail.com>:
> I took it for granted that programmers would "naturally" expect that eg "â"
> would sort outside the [a-z] range, as in latin-1 or with Unicode
> _precomposed_ codes. Maybe because I have myself eaten too much latin-1. If
> I understand Dirk's method and proposal, he takes the opposite for granted,
> namely that composite letters (like "â") should sort with (or just after)
> their corresponding simple letter (here "a").

There are two reasons for doing it my way:

1. People whose mother tongue is English think of accents as optional
   extras that only foreigners use. If you Google for the most famous
   dissident leader in Communist Poland, on the first two pages of hits
   only the Wikipedia/Wikiquote sites put in the accents on the
   `l` and `e`.  (On page 3 finally a Polish site is hit.)
2. The Lua string library is byte-oriented. I can search for five-letter
   proper names with "%u%l%l%l%l" and find "Emile" but not "Émile".

> * Illustrate the consequence with examples such as "â" in decomposed
> and precomposed forms.

If the whole point of your proposed library is to treat e.g â (u00E2)
and â (u0061u0302) as denoting the same character, then I do not have
anything to contribute.  They look visibly different in all three
fonts currently on my screen (the second form looks worse in
three different ways: accent too small, accent too high, accent
too low).

> Thus, I guess overloading char ranges with systematic gsub+mapping
> may multiply parsing times.

For any serious parsing work, Lua patterns are clumsy and Roberto's
lpeg module really becomes essential.

Follow-Ups:
- Re: unicode char ranges, Luiz Henrique de Figueiredo
- Re: unicode char ranges, Coda Highland
- Re: unicode char ranges, Jay Carlson

References:
- unicode char ranges, spir
- Re: unicode char ranges, spir

Prev by Date: Bug in lua 5.2.1?
Next by Date: RE: unicode char ranges
Previous by thread: Re: unicode char ranges
Next by thread: Re: unicode char ranges
Index(es):
- Date
- Thread