[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: Matching multibyte alphabetical characters with LPeG
- From: Jay Carlson <nop@...>
- Date: Sun, 17 Jun 2012 15:52:49 -0400
On Jun 17, 2012, at 9:38 AM, Miles Bader wrote:
> Hinrik Örn Sigurðsson <firstname.lastname@example.org> writes:
>> I've been making a parser with LPeG and I've run into the issue that I can't
>> match non-ASCII words even though I'm using a utf8 locale. It seems that
>> "alpha" (and "alnum", etc) from lpeg.locale() don't match anything beyond ASCII.
>> See the following code:
>> local lpeg = require 'lpeg'
>> local locale = lpeg.locale()
>> print(lpeg.match(lpeg.C(lpeg.P("æ")), "æ")) --> æ
>> print(lpeg.match(lpeg.C(locale.alpha), "æ")) --> nil
>> Is there an easy way to match non-ASCII alphabetical characters with LPeG?
> No -- it's not so hard to parse utf-8 characters, but testing a
> property like "alphabetic" requires unicode tables, which are a huge
> and bloated dependency.
Only in Lua would this be considered bloated:
sunk-cost:slnunicode-1.1a nop$ size slnudata.o
__TEXT __DATA __OBJC others dec hex
0 14012 0 0 14012 36bc
No, it does not provide enough to write a bidi renderer, but it does characterize each code point as one of 30 classes--and includes toupper/tolower/totitlecase.
There's still the grapheme problem for å vs å; hopefully you can't tell the second is "a".."␣̊". 
How should lpeg match the one with a separate combining mark version against character classes?
: No idea whether Mail.app is going to normalize this on the way out, so if there's only one code point in "å" I apologize.