Re: Lua 5.1 and UTF-8 ?

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Lua 5.1 and UTF-8 ?
From: Rici Lake <lua@...>
Date: Sun, 22 May 2005 16:18:29 -0500


On 22-May-05, at 3:10 PM, Klaus Ripke wrote:

they specify a couple, but they do not matter that much in practice.
E.g. in germany, with our beloved umlauts, it would be next to
suicidal for any application to provide them in decomposed
normal form by default, and probably this holds at least for
Latin-1 land. Where you really have to deal with data integration
from heterogenous sources, you're in much bigger trouble anyway.
And as it involves complex and expensive algorithms, and
is a moving target, you have to have the choice which
normalize() you call where.

That's fine for German. It doesn't work so well for non-roman basedalphabets.

I'm not 100% convinced about composed versus decomposed either. InRomance languages, holding the data in decomposed normalization makescertain operations (such as stemming) considerably easier. (Romancelanguages differ significantly from Nordic and Germanic languages intheir approach to diacritics; in Spanish, for example, accented lettersare *not* distinct letters.)

I'm not sure that we are on the same wavelength about graphemes either.Have you looked at Indic languages?

Follow-Ups:
- Re: Lua 5.1 and UTF-8 ?, Klaus Ripke

References:
- Lua 5.1 and UTF-8 ?, Asko Kauppi
- Re: Lua 5.1 and UTF-8 ?, Rici Lake
- Re: Lua 5.1 and UTF-8 ?, Klaus Ripke

Prev by Date: Re: Lua 5.1 and UTF-8 ?
Next by Date: Re: Lua 5.1 and UTF-8 ?
Previous by thread: Re: Lua 5.1 and UTF-8 ?
Next by thread: Re: Lua 5.1 and UTF-8 ?
Index(es):
- Date
- Thread