lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Right, things are clearer, i guess, the discussion helped me better realise that there is no basic or natural order, even just for programmers. for the concrete check whether a character is in range, I will stick with code-ordering, meaning lexicographical order (as in dictionaries) of multi-byte and/or multi-code characters based on code numeric values. however, the problem remains that depending on actual coding form, decomposed or precomposed [1], code-order is different. (This is similar, but different from the issue that a given char or string may _literally_ match depending on the presently used coding form.)

I took it for granted that programmers would "naturally" expect that eg "â" would sort outside the [a-z] range, as in latin-1 or with Unicode _precomposed_ codes. Maybe because I have myself eaten too much latin-1. If I understand Dirk's method and proposal, he takes the opposite for granted, namely that composite letters (like "â") should sort with (or just after) their corresponding simple letter (here "a"). If I misunderstood, I'm sorry. In any case, maybe it's better to first consider this view because --I guess-- it is probably closer to end-user expectation and common ordering in various languages (there are languages where composite letters sort totally apart, though, if you know concrete examples, thank you).

I'm considering the following design:
* Make it clear that the lib deals with character ranges "stupidly", according to code values in sequence. * Illustrate the consequence with examples such as "â" in decomposed and precomposed forms. * Process the source text as is, meaning that (1) if precomposed then composite characters sort apart (2) if decomposed then composite characters sort just after their base.
* Plan for letting the door open to an extension proposing 3 modes:
0. base mode, source is left text as is, unprocessed (or possibly just decoded to a unicode seq)
  1. source is preprocessed toward decomposed form
  2. source is preprocessed toward precomposed form
What do you think?

A solution à la Dirk may not be very practicle for a general-purpose lib. Actually, to be really useful, it would probably require in many cases for users to define their own map table, wouldn't it? And it is very costly, since char ranges are, with plain literals, the base matching pattern of all string parsing (ultimately each char in source is either matched as [part of] a literal or as valid for a range); also, due to alternatives and other choices in grammar, typically composed in layers of syntactic structures, each bit of source is usually matched numerous times before matching succeeds. Thus, I guess overloading char ranges with systematic gsub+mapping may multiply parsing times.

Else, char-range matching is simple enough. To allow for any unicode source (including pattern literals), presently a range is just defined as:
    digit = Range {"0","9"}
Thus (unlike regex-like formats) each range-border character can be artitrarily compound (multibyte, multicode) and when composite in arbitrary form (decomposed or precomposed). Then, matching is just checking whether, byte after byte or code after code (if the source is decoded), the snippet in source is in range, lexicographically. (However, without full normalisation I cannot just compare whether next "character" is properly >= and <=, since I have no idea how many bytes or codes said character covers. I need to check step by step until 1 given byte or code tells me "no!" or "yes". Though a specialisation is easy for the probalby very common, but not general case where border chars have same length.)
Seems sensible?

denis

[1] Or mixed (!): as in a-with-tilde-and-dot-below coded as a-with-tilde + dot-below. One problem with precomposed forms is that they also introduce half-composed forms...