[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: Lua 5.1 and UTF-8 ?
- From: Rici Lake <lua@...>
- Date: Sun, 22 May 2005 16:18:29 -0500
On 22-May-05, at 3:10 PM, Klaus Ripke wrote:
they specify a couple, but they do not matter that much in practice.
E.g. in germany, with our beloved umlauts, it would be next to
suicidal for any application to provide them in decomposed
normal form by default, and probably this holds at least for
Latin-1 land. Where you really have to deal with data integration
from heterogenous sources, you're in much bigger trouble anyway.
And as it involves complex and expensive algorithms, and
is a moving target, you have to have the choice which
normalize() you call where.
That's fine for German. It doesn't work so well for non-roman based
I'm not 100% convinced about composed versus decomposed either. In
Romance languages, holding the data in decomposed normalization makes
certain operations (such as stemming) considerably easier. (Romance
languages differ significantly from Nordic and Germanic languages in
their approach to diacritics; in Spanish, for example, accented letters
are *not* distinct letters.)
I'm not sure that we are on the same wavelength about graphemes either.
Have you looked at Indic languages?