lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Sun, May 22, 2005 at 04:18:29PM -0500, Rici Lake wrote:
> 
> On 22-May-05, at 3:10 PM, Klaus Ripke wrote:
> 
> That's fine for German. It doesn't work so well for non-roman based 
> alphabets.
> 
> I'm not 100% convinced about composed versus decomposed either.
I didn't want to suggest that composed is the way to go,
but rather that in any given environment you usually will
have some preferred "normal" form, as produced by the most
common means for entering text. 

I don't think our little not-that-bulky lib is german-centric
in any way. I didn't implement special casing just because the
szlig uppercases to double-S. Who wants to add it, may add it.

> Romance languages, holding the data in decomposed normalization makes 
> certain operations (such as stemming) considerably easier.
claro que si. For CDS/ISIS I am working a lot with people in
latin america, but they usually have their data in Latin-1
or the Latin-1 subset of Unicode, i.e. composed forms.
All the data has been entered in Latin-1 (or Cp850) for years
and nobody felt any need to run a decomposing normalization.
The n with tilde is a very distinctive character,
causing much trouble when decomposed.

Sooo ... although there are canonical normal forms defined,
the most appropriate NF is not only a complex issue, but also
dependent on locale (as some of the special casings are).
And no, setlocale() is not the answer.
It's just another level to be added where needed.

> I'm not sure that we are on the same wavelength about graphemes either. 
I guess we are at the same wavelength to keep things as
simple and small as possible at the basis?

With regard to graphemes my wavelength is, lazy as I am,
to implement the cheap-and-easy part of
http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
which is the standard Grapheme_Extend property in the UCD,
and leave adding of Other_Grapheme_Extend and hangul syllables
to those who need it.

We provide four "personalities" ascii, latin1, utf8 and grapheme
simply because their properties can be derived from the same table.
There is no problem to add more such personalities
but adding more tables.

Wether to bundle the full monty with all the specialties #defined
is then distribution maintainer's headache.

> Have you looked at Indic languages?
Yes. Not too much trouble with grapheme clusters,
some use one of the 18 Other_Grapheme_Extend of
http://www.unicode.org/Public/UNIDATA/PropList.txt .
Hangul, on the other hand, requires special algorithms.



regards