lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


It's early in the morning of a new day, I'm able to read long and
complicated posts, and to write detailed answers. Step One is break
down the OP's essay into items for easy reference.

> 1.1 design of a text matching lib,
> 1.2 supposed to work with utf8-encoded Unicode
> 1.3 possibly other encodings if it appears useful to decode the
>     source anyway
> 1.4 the issue of precomposed vs decomposed chars
> 1.5 which would simply not match if the Unicode coding
>     (not en-coding) is not the same in source and grammar

> 2.1 this appears impossible with [without?] char ranges
> 2.2 What is the best solution, if any?

> 3.1 it is in fact frequent to match wide ranges eg
> 3.1.1 all non-ascii [\x80-\x110000]
> 3.1.2 all BMP (Basic Multilingual Plane) [\x0-\x1000].

> 4.1 decomposed normalisation [... is] a dev task bigger anyway than
>     what I imagined
> 4.2 light library [...] in mind
> 4.3 difficult to maintain by anyone else as the author.

> 5.1 I would be pleased to avoid all that.
> 5.2 warn users that the lib probably will fail when dealing with any
>     decomposed input
> 5.3 users (I mean programmers) have no way to know, even less to
>     guarantee, their source data is fully composed.
> 5.4 we cannot live without char ranges, can we?

> 6.1 I may ask on the Unicode mailing list. I don't expect any good
>     from such an attempt, experience shows they are "esquive experts",
>     esp. by playing with words such as "character" and "grapheme",
>     using them in senses only Unicode "officials" know of ;-).
> 6.2 they probably are very good players of combat games.

Point 4.2 and the fact that the issue was raised on lua-l suggest that
the OP is looking for something achievable with the Lua string library.

Points 2.1 and 5.4 show that the OP is resigned to not achieving the full
generality demanded by 3.1.1, but may be prepared to live with 3.1.2.

Points 6.1 and 6.2 suggest that the OP is not looking for a solution
that can accommodate all the subtleties of the full current Unicode
standard as translated to UTF-8.

My strategy (which seems not to be deducible from a first casual look
at my implementation) is based on the following observations.

A. In a valid UTF-8 string, replacing a multibyte character by
   a one-byte ASCII character leaves the other multibyte characters
   invariant.  Repeating this process for all multibyte characters
   in a particular class leaves a valid UTF-8 string in which all
   multibyte characters not in that class are still there and all
   other characters have been translated to ASCII.

B. Lua pattern matching, when asked to match in a valid UTF-8 string
   a pattern that matches a range of Unicode characters differing
   only in their last byte, cannot miss any such character and cannot
   accidentally match anything else.

C. Lua's feature that the replacement string may be defined by a table
   makes it possible to represent the translation directly visible.
   If more than one block of 128 Unicode characters is to be supported,
   a separate table for each would be needed; this is not required for
   the files I work with.

D. It is therefore possible to define a function asciize(str) such that
   (i) Every supported multibyte character maps to an ASCII character.
  (ii) The presence of unsupported characters in `str` can be detected
       by the presence non-ASCII chracters in `asciize(str)`.
 (iii) For many Lua patterns (I would not care to define, against the
       scrutiny of `esquive experts`, exactly which), the success of
          `asciize(source):match(asciize(pattern))`
       is a necessary but not sufficient condition for the success of
          `utf8_match(source,pattern)`
       for some reasonably intuitive definition of `utf8_match`.
          `asciize(source:upper()):match(asciize(pattern:upper()))`
       is a weaker necessary condition that can sometimes be useful.

After that, post-processing is required. This is not something that
allows a one-size-fits-all approach.  Say the application is to scan
Internet articles on classical music for references to the composer
Gabriel Fauré. Do you wish to limit the search to contributions by
authors meticulous enough not to have typed Faure, Faurè or even Faurê?
You can, by scanning the texts found for the exact match.  In the meantime,
the 99% of articles that mention only Bach, Beethoven, Brahms etc can
safely be ignored.

I think this approach is genuine, correct, reasonable, extensible and
useful. It is also lightweight (to achieve 3.1.1 requires only 16 tables)
and easily maintainable by someone not the author.