[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: UTF-8 patterns in Lua 5.3
- From: Elias Barrionovo <elias.tandel@...>
- Date: Thu, 17 Apr 2014 12:13:27 -0300
On Thu, Apr 17, 2014 at 2:32 AM, Dirk Laurie <dirk.laurie@gmail.com> wrote:
> For finding words in text, I have been using a mapping from accented
> Latin characters into ASCII.The mapping is confined to characters
> that combine a single letter from the Latin alphabet with diacritics,
> and it does not matter whether one's compose key or one's dead-letter
> key added the diacritics.
One trick I use a lot in Python is this:
def remove_diacritic(s):
return unicodedata.normalize('NFKD', s).encode('ASCII', 'ignore')
IIRC, it reads like "'split' the characters and interpret the string
as ascii ignoring whatever you don't understand". The result is that
things like "açafrão" become "acafrao". This means it could be
implemented in Lua if there were a way to normalize unicode strings.
However, I don't know enough unicode to know how hard or complex it is
to build unicode normalization into core Lua.
--
NI!
() - www.asciiribbon.org
/\ - ascii ribbon campaign against html e-mail and proprietary attachments