Re: unicode char ranges

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: unicode char ranges
From: spir <denis.spir@...>
Date: Wed, 05 Dec 2012 11:45:20 +0100

Right, things are clearer, i guess, the discussion helped me better realise thatthere is no basic or natural order, even just for programmers. for the concretecheck whether a character is in range, I will stick with code-ordering, meaninglexicographical order (as in dictionaries) of multi-byte and/or multi-codecharacters based on code numeric values. however, the problem remains thatdepending on actual coding form, decomposed or precomposed [1], code-order isdifferent. (This is similar, but different from the issue that a given char orstring may _literally_ match depending on the presently used coding form.)

I took it for granted that programmers would "naturally" expect that eg "â"would sort outside the [a-z] range, as in latin-1 or with Unicode _precomposed_codes. Maybe because I have myself eaten too much latin-1. If I understandDirk's method and proposal, he takes the opposite for granted, namely thatcomposite letters (like "â") should sort with (or just after) theircorresponding simple letter (here "a"). If I misunderstood, I'm sorry.In any case, maybe it's better to first consider this view because --I guess--it is probably closer to end-user expectation and common ordering in variouslanguages (there are languages where composite letters sort totally apart,though, if you know concrete examples, thank you).


I'm considering the following design:

* Make it clear that the lib deals with character ranges "stupidly", accordingto code values in sequence.* Illustrate the consequence with examples such as "â" in decomposed andprecomposed forms.* Process the source text as is, meaning that (1) if precomposed then compositecharacters sort apart (2) if decomposed then composite characters sort justafter their base.

* Plan for letting the door open to an extension proposing 3 modes:

0. base mode, source is left text as is, unprocessed (or possibly justdecoded to a unicode seq)

  1. source is preprocessed toward decomposed form
  2. source is preprocessed toward precomposed form
What do you think?

A solution à la Dirk may not be very practicle for a general-purpose lib.Actually, to be really useful, it would probably require in many cases for usersto define their own map table, wouldn't it? And it is very costly, since charranges are, with plain literals, the base matching pattern of all string parsing(ultimately each char in source is either matched as [part of] a literal or asvalid for a range); also, due to alternatives and other choices in grammar,typically composed in layers of syntactic structures, each bit of source isusually matched numerous times before matching succeeds. Thus, I guessoverloading char ranges with systematic gsub+mapping may multiply parsing times.

Else, char-range matching is simple enough. To allow for any unicode source(including pattern literals), presently a range is just defined as:

    digit = Range {"0","9"}

Thus (unlike regex-like formats) each range-border character can be artitrarilycompound (multibyte, multicode) and when composite in arbitrary form (decomposedor precomposed). Then, matching is just checking whether, byte after byte orcode after code (if the source is decoded), the snippet in source is in range,lexicographically.(However, without full normalisation I cannot just compare whether next"character" is properly >= and <=, since I have no idea how many bytes or codessaid character covers. I need to check step by step until 1 given byte or codetells me "no!" or "yes". Though a specialisation is easy for the probalby verycommon, but not general case where border chars have same length.)

Seems sensible?

denis

[1] Or mixed (!): as in a-with-tilde-and-dot-below coded as a-with-tilde +dot-below. One problem with precomposed forms is that they also introducehalf-composed forms...

Follow-Ups:
- Re: unicode char ranges, Dirk Laurie

References:
- unicode char ranges, spir

Prev by Date: Re: bug(?): lua -l ...
Next by Date: Re: unicode char ranges
Previous by thread: Re: unicode char ranges
Next by thread: Re: unicode char ranges
Index(es):
- Date
- Thread