lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Hello,

I'm running into an unexpected issue for the design of a text matching lib, which among other weird points is supposed to work with utf8-encoded Unicode source --possibly other encodings if it appears useful to decode the source anyway. I was ready (like everyone else, it seems) to pretend ignoring the issue of precomposed vs decomposed chars; which would simply not match if the Unicode coding (not en-coding) is not the same in source and grammar: eg "coração" would match only if both the source and the grammar are decomposed or both are precomposed. Same point about char sets, sets of _individual_ chars.

But this appears impossible with char ranges, unless I miss a point. In lexicographical order according to unicodes (Unicode code points) [1], a decomposed "ã" would find place between "a" and "b" since its first code is a base 'a' --which for people unfamiliar with Unicode is the same code as the one for a full, simple character "a". Thus, for example, [a-z] would match all decomposed latin-based lowercase letters, simple and composite (including ones that are not in use in any language like 'm' with a tilde or 'i' with dots both below and above).

What do you think? What is the best solution, if any? I was thinking at reducing the use of char ranges in grammar to characters of the same (byte or code) length, but it is in fact frequent to match wide ranges, eg all non-ascii [\x80-\x110000] or all BMP (Basic Multilingual Plane) [\x0-\x1000].

At first sight, a proper solution involves full normalisation. For the record, decomposed normalisation (NFD, the right way in my view) can be made rather efficient if proper structures [2] are used for the unicode data necessary for the process [3] [4]. But it's a dev task bigger anyway than what I imagined investing for the light library I have in mind, especially the preparation of this data, and somewhat heavy & complicated stuff in code proper (I mean typically the kind difficult to maintain by anyone else as the author). Also, NFD normalisation does not solve the issue of characters in ranges since it yields decomposed character codings. Precomposed normalisation is more complicated and costly since it requires first decomposing (or an even more complicated algorithm, with no realised implementation I know of, even in Unicode doc) and I have no idea of its actual cost.

I would be pleased to avoid all that. The only way I can see is to warn users that the lib probably will fail when dealing with any decomposed input stuff; but it is in a sense a stupid attitude, because users (I mean programmers) have no way to know, even less to guarantee, their source data is fully composed. What are we supposed to do. Note that pretending to ignore the issue due to composed / decomposed coding forms is not the same as not working at all for decomposed sources (!): in the first case all would work fine if both the source and the grammar are decomposed; but I don't how to achieve this, precisely, with char ranges. And we cannot live without car ranges, can we?

Hum. I may ask on the Unicode mailing list. [5]

Denis

[1] Thus also in utf8 form since it is so designed that comparison of utf8 yields the same result as comparison of plain codes.

[2] In C routines for Lua, I imagine using C-side Lua tables. They would be very appropriate since the data is ass arrays and sparse arrays (with big holes, esp for the Hangul data), and combinations of sets and ranges. Don't know however how easy it is to reuse the implementation (I imagine it is somewhat complicated). Or maybe doing it in C is not a big gain and Lua-side Lua tables would do the job.

[3] Unlike what is sometimes stated, Hangul is no issue algorithmically; the point is rather the size of data tables required just for Hangul, and the job of constructing them from raw Unicode data as available online. A more basic issue is Unicode's decomposed normalisations (NFD & NFKD) do *not* state code order (!!!) (only the base code is _nearly_ always in 1st position). We have to sort (composing) codes inside characters according to a custom ordering scheme and thus cannot safely reuse source texts normalised by other software if they don't use the same scheme, or simply if we don't know it...

[4] In D, I managed to have NFD normalisation run in about 2-3 times the cost of decoding alone. ICU does better in C/++, also using clever structures (I did not really understand). But D has a great advantage, namely that substrings (actually all array slices) do not copy any stuff, instead are just (p, len) structs pointing to a zone in the original char array (just like the original string/array structure in fact, except the start pointer is not on the 1st item). Lua is very costly initially (due not only to allocation but to hashing for interning), but once strings exist comparison is blitz-fast. But normalisation does not involve comparison of substrings, only of bytes initially and later of codes.

[5] I don't expect any good from such an attempt, experience shows they are "esquive experts", esp. by playing with words such as "character" and "grapheme", using them in senses only Unicode "officials" know of ;-). (They probably are very good players of combat games.)