[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: unicode char ranges
- From: spir <denis.spir@...>
- Date: Wed, 05 Dec 2012 11:45:20 +0100
Right, things are clearer, i guess, the discussion helped me better realise that
there is no basic or natural order, even just for programmers. for the concrete
check whether a character is in range, I will stick with code-ordering, meaning
lexicographical order (as in dictionaries) of multi-byte and/or multi-code
characters based on code numeric values. however, the problem remains that
depending on actual coding form, decomposed or precomposed [1], code-order is
different. (This is similar, but different from the issue that a given char or
string may _literally_ match depending on the presently used coding form.)
I took it for granted that programmers would "naturally" expect that eg "â"
would sort outside the [a-z] range, as in latin-1 or with Unicode _precomposed_
codes. Maybe because I have myself eaten too much latin-1. If I understand
Dirk's method and proposal, he takes the opposite for granted, namely that
composite letters (like "â") should sort with (or just after) their
corresponding simple letter (here "a"). If I misunderstood, I'm sorry.
In any case, maybe it's better to first consider this view because --I guess--
it is probably closer to end-user expectation and common ordering in various
languages (there are languages where composite letters sort totally apart,
though, if you know concrete examples, thank you).
I'm considering the following design:
* Make it clear that the lib deals with character ranges "stupidly", according
to code values in sequence.
* Illustrate the consequence with examples such as "â" in decomposed and
precomposed forms.
* Process the source text as is, meaning that (1) if precomposed then composite
characters sort apart (2) if decomposed then composite characters sort just
after their base.
* Plan for letting the door open to an extension proposing 3 modes:
0. base mode, source is left text as is, unprocessed (or possibly just
decoded to a unicode seq)
1. source is preprocessed toward decomposed form
2. source is preprocessed toward precomposed form
What do you think?
A solution à la Dirk may not be very practicle for a general-purpose lib.
Actually, to be really useful, it would probably require in many cases for users
to define their own map table, wouldn't it? And it is very costly, since char
ranges are, with plain literals, the base matching pattern of all string parsing
(ultimately each char in source is either matched as [part of] a literal or as
valid for a range); also, due to alternatives and other choices in grammar,
typically composed in layers of syntactic structures, each bit of source is
usually matched numerous times before matching succeeds. Thus, I guess
overloading char ranges with systematic gsub+mapping may multiply parsing times.
Else, char-range matching is simple enough. To allow for any unicode source
(including pattern literals), presently a range is just defined as:
digit = Range {"0","9"}
Thus (unlike regex-like formats) each range-border character can be artitrarily
compound (multibyte, multicode) and when composite in arbitrary form (decomposed
or precomposed). Then, matching is just checking whether, byte after byte or
code after code (if the source is decoded), the snippet in source is in range,
lexicographically.
(However, without full normalisation I cannot just compare whether next
"character" is properly >= and <=, since I have no idea how many bytes or codes
said character covers. I need to check step by step until 1 given byte or code
tells me "no!" or "yes". Though a specialisation is easy for the probalby very
common, but not general case where border chars have same length.)
Seems sensible?
denis
[1] Or mixed (!): as in a-with-tilde-and-dot-below coded as a-with-tilde +
dot-below. One problem with precomposed forms is that they also introduce
half-composed forms...