[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: unicode char ranges
- From: spir <denis.spir@...>
- Date: Tue, 04 Dec 2012 10:28:30 +0100
Hello,
I'm running into an unexpected issue for the design of a text matching lib,
which among other weird points is supposed to work with utf8-encoded Unicode
source --possibly other encodings if it appears useful to decode the source
anyway. I was ready (like everyone else, it seems) to pretend ignoring the issue
of precomposed vs decomposed chars; which would simply not match if the Unicode
coding (not en-coding) is not the same in source and grammar: eg "coração" would
match only if both the source and the grammar are decomposed or both are
precomposed. Same point about char sets, sets of _individual_ chars.
But this appears impossible with char ranges, unless I miss a point. In
lexicographical order according to unicodes (Unicode code points) [1], a
decomposed "ã" would find place between "a" and "b" since its first code is a
base 'a' --which for people unfamiliar with Unicode is the same code as the one
for a full, simple character "a". Thus, for example, [a-z] would match all
decomposed latin-based lowercase letters, simple and composite (including ones
that are not in use in any language like 'm' with a tilde or 'i' with dots both
below and above).
What do you think? What is the best solution, if any? I was thinking at reducing
the use of char ranges in grammar to characters of the same (byte or code)
length, but it is in fact frequent to match wide ranges, eg all non-ascii
[\x80-\x110000] or all BMP (Basic Multilingual Plane) [\x0-\x1000].
At first sight, a proper solution involves full normalisation. For the record,
decomposed normalisation (NFD, the right way in my view) can be made rather
efficient if proper structures [2] are used for the unicode data necessary for
the process [3] [4]. But it's a dev task bigger anyway than what I imagined
investing for the light library I have in mind, especially the preparation of
this data, and somewhat heavy & complicated stuff in code proper (I mean
typically the kind difficult to maintain by anyone else as the author).
Also, NFD normalisation does not solve the issue of characters in ranges since
it yields decomposed character codings. Precomposed normalisation is more
complicated and costly since it requires first decomposing (or an even more
complicated algorithm, with no realised implementation I know of, even in
Unicode doc) and I have no idea of its actual cost.
I would be pleased to avoid all that. The only way I can see is to warn users
that the lib probably will fail when dealing with any decomposed input stuff;
but it is in a sense a stupid attitude, because users (I mean programmers) have
no way to know, even less to guarantee, their source data is fully composed.
What are we supposed to do. Note that pretending to ignore the issue due to
composed / decomposed coding forms is not the same as not working at all for
decomposed sources (!): in the first case all would work fine if both the source
and the grammar are decomposed; but I don't how to achieve this, precisely, with
char ranges. And we cannot live without car ranges, can we?
Hum. I may ask on the Unicode mailing list. [5]
Denis
[1] Thus also in utf8 form since it is so designed that comparison of utf8
yields the same result as comparison of plain codes.
[2] In C routines for Lua, I imagine using C-side Lua tables. They would be very
appropriate since the data is ass arrays and sparse arrays (with big holes, esp
for the Hangul data), and combinations of sets and ranges. Don't know however
how easy it is to reuse the implementation (I imagine it is somewhat
complicated). Or maybe doing it in C is not a big gain and Lua-side Lua tables
would do the job.
[3] Unlike what is sometimes stated, Hangul is no issue algorithmically; the
point is rather the size of data tables required just for Hangul, and the job of
constructing them from raw Unicode data as available online.
A more basic issue is Unicode's decomposed normalisations (NFD & NFKD) do *not*
state code order (!!!) (only the base code is _nearly_ always in 1st position).
We have to sort (composing) codes inside characters according to a custom
ordering scheme and thus cannot safely reuse source texts normalised by other
software if they don't use the same scheme, or simply if we don't know it...
[4] In D, I managed to have NFD normalisation run in about 2-3 times the cost of
decoding alone. ICU does better in C/++, also using clever structures (I did not
really understand). But D has a great advantage, namely that substrings
(actually all array slices) do not copy any stuff, instead are just (p, len)
structs pointing to a zone in the original char array (just like the original
string/array structure in fact, except the start pointer is not on the 1st item).
Lua is very costly initially (due not only to allocation but to hashing for
interning), but once strings exist comparison is blitz-fast. But normalisation
does not involve comparison of substrings, only of bytes initially and later of
codes.
[5] I don't expect any good from such an attempt, experience shows they are
"esquive experts", esp. by playing with words such as "character" and
"grapheme", using them in senses only Unicode "officials" know of ;-). (They
probably are very good players of combat games.)