unicode char ranges

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: unicode char ranges
From: spir <denis.spir@...>
Date: Tue, 04 Dec 2012 10:28:30 +0100

Hello,

I'm running into an unexpected issue for the design of a text matching lib,which among other weird points is supposed to work with utf8-encoded Unicodesource --possibly other encodings if it appears useful to decode the sourceanyway. I was ready (like everyone else, it seems) to pretend ignoring the issueof precomposed vs decomposed chars; which would simply not match if the Unicodecoding (not en-coding) is not the same in source and grammar: eg "coração" wouldmatch only if both the source and the grammar are decomposed or both areprecomposed. Same point about char sets, sets of _individual_ chars.

But this appears impossible with char ranges, unless I miss a point. Inlexicographical order according to unicodes (Unicode code points) [1], adecomposed "ã" would find place between "a" and "b" since its first code is abase 'a' --which for people unfamiliar with Unicode is the same code as the onefor a full, simple character "a". Thus, for example, [a-z] would match alldecomposed latin-based lowercase letters, simple and composite (including onesthat are not in use in any language like 'm' with a tilde or 'i' with dots bothbelow and above).

What do you think? What is the best solution, if any? I was thinking at reducingthe use of char ranges in grammar to characters of the same (byte or code)length, but it is in fact frequent to match wide ranges, eg all non-ascii[\x80-\x110000] or all BMP (Basic Multilingual Plane) [\x0-\x1000].

At first sight, a proper solution involves full normalisation. For the record,decomposed normalisation (NFD, the right way in my view) can be made ratherefficient if proper structures [2] are used for the unicode data necessary forthe process [3] [4]. But it's a dev task bigger anyway than what I imaginedinvesting for the light library I have in mind, especially the preparation ofthis data, and somewhat heavy & complicated stuff in code proper (I meantypically the kind difficult to maintain by anyone else as the author).Also, NFD normalisation does not solve the issue of characters in ranges sinceit yields decomposed character codings. Precomposed normalisation is morecomplicated and costly since it requires first decomposing (or an even morecomplicated algorithm, with no realised implementation I know of, even inUnicode doc) and I have no idea of its actual cost.

I would be pleased to avoid all that. The only way I can see is to warn usersthat the lib probably will fail when dealing with any decomposed input stuff;but it is in a sense a stupid attitude, because users (I mean programmers) haveno way to know, even less to guarantee, their source data is fully composed.What are we supposed to do. Note that pretending to ignore the issue due tocomposed / decomposed coding forms is not the same as not working at all fordecomposed sources (!): in the first case all would work fine if both the sourceand the grammar are decomposed; but I don't how to achieve this, precisely, withchar ranges. And we cannot live without car ranges, can we?


Hum. I may ask on the Unicode mailing list. [5]

Denis

[1] Thus also in utf8 form since it is so designed that comparison of utf8yields the same result as comparison of plain codes.

[2] In C routines for Lua, I imagine using C-side Lua tables. They would be veryappropriate since the data is ass arrays and sparse arrays (with big holes, espfor the Hangul data), and combinations of sets and ranges. Don't know howeverhow easy it is to reuse the implementation (I imagine it is somewhatcomplicated). Or maybe doing it in C is not a big gain and Lua-side Lua tableswould do the job.

[3] Unlike what is sometimes stated, Hangul is no issue algorithmically; thepoint is rather the size of data tables required just for Hangul, and the job ofconstructing them from raw Unicode data as available online.A more basic issue is Unicode's decomposed normalisations (NFD & NFKD) do *not*state code order (!!!) (only the base code is _nearly_ always in 1st position).We have to sort (composing) codes inside characters according to a customordering scheme and thus cannot safely reuse source texts normalised by othersoftware if they don't use the same scheme, or simply if we don't know it...

[4] In D, I managed to have NFD normalisation run in about 2-3 times the cost ofdecoding alone. ICU does better in C/++, also using clever structures (I did notreally understand). But D has a great advantage, namely that substrings(actually all array slices) do not copy any stuff, instead are just (p, len)structs pointing to a zone in the original char array (just like the originalstring/array structure in fact, except the start pointer is not on the 1st item).Lua is very costly initially (due not only to allocation but to hashing forinterning), but once strings exist comparison is blitz-fast. But normalisationdoes not involve comparison of substrings, only of bytes initially and later ofcodes.

[5] I don't expect any good from such an attempt, experience shows they are"esquive experts", esp. by playing with words such as "character" and"grapheme", using them in senses only Unicode "officials" know of ;-). (Theyprobably are very good players of combat games.)

Follow-Ups:
- Re: unicode char ranges, Dirk Laurie
- Re: unicode char ranges, Dirk Laurie
- Re: unicode char ranges, spir
- Re: unicode char ranges, Pierpaolo Bernardi

Prev by Date: Re: LuaBitOp: Cant require "bit=>./bit.so: undefined symbol: luaopen_bit
Next by Date: Re: Boolean matters
Previous by thread: Re: Tables, Metatables, Children, and Length
Next by thread: Re: unicode char ranges
Index(es):
- Date
- Thread