lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

On 22-May-05, at 3:10 PM, Klaus Ripke wrote:

they specify a couple, but they do not matter that much in practice.
E.g. in germany, with our beloved umlauts, it would be next to
suicidal for any application to provide them in decomposed
normal form by default, and probably this holds at least for
Latin-1 land. Where you really have to deal with data integration
from heterogenous sources, you're in much bigger trouble anyway.
And as it involves complex and expensive algorithms, and
is a moving target, you have to have the choice which
normalize() you call where.

That's fine for German. It doesn't work so well for non-roman based alphabets.

I'm not 100% convinced about composed versus decomposed either. In Romance languages, holding the data in decomposed normalization makes certain operations (such as stemming) considerably easier. (Romance languages differ significantly from Nordic and Germanic languages in their approach to diacritics; in Spanish, for example, accented letters are *not* distinct letters.)

I'm not sure that we are on the same wavelength about graphemes either. Have you looked at Indic languages?