Re: Matching multibyte alphabetical characters with LPeG

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Matching multibyte alphabetical characters with LPeG
From: Jay Carlson <nop@...>
Date: Sun, 17 Jun 2012 15:52:49 -0400

On Jun 17, 2012, at 9:38 AM, Miles Bader wrote:

> Hinrik Örn Sigurðsson <hinrik.sig@gmail.com> writes:
>> I've been making a parser with LPeG and I've run into the issue that I can't
>> match non-ASCII words even though I'm using a utf8 locale. It seems that
>> "alpha" (and "alnum", etc) from lpeg.locale() don't match anything beyond ASCII.
>> See the following code:
>> 
>>    local lpeg = require 'lpeg'
>>    local locale = lpeg.locale()
>>    print(lpeg.match(lpeg.C(lpeg.P("æ")), "æ"))    --> æ
>>    print(lpeg.match(lpeg.C(locale.alpha), "æ"))   --> nil
>> 
>> Is there an easy way to match non-ASCII alphabetical characters with LPeG?
> 
> No -- it's not so hard to parse utf-8 characters, but testing a
> property like "alphabetic" requires unicode tables, which are a huge
> and bloated dependency.

Only in Lua would this be considered bloated:

sunk-cost:slnunicode-1.1a nop$ size slnudata.o
__TEXT	__DATA	__OBJC	others	dec	hex
0	14012	0	0	14012	36bc

No, it does not provide enough to write a bidi renderer, but it does characterize each code point as one of 30 classes--and includes toupper/tolower/totitlecase.

http://files.luaforge.net/releases/sln/slnunicode

There's still the grapheme problem for å vs å; hopefully you can't tell the second is "a".."␣̊". [1] 

How should lpeg match the one with a separate combining mark version against character classes?

Jay

[1]: No idea whether Mail.app is going to normalize this on the way out, so if there's only one code point in "å" I apologize.

Follow-Ups:
- Re: Matching multibyte alphabetical characters with LPeG, Craig Barnes
- Re: Matching multibyte alphabetical characters with LPeG, Miles Bader
- Re: Matching multibyte alphabetical characters with LPeG, William Ahern

References:
- Matching multibyte alphabetical characters with LPeG, Hinrik Örn Sigurðsson
- Re: Matching multibyte alphabetical characters with LPeG, Miles Bader

Prev by Date: Re: Pari/GP has introduced lightweight anonymous function syntax
Next by Date: Re: Matching multibyte alphabetical characters with LPeG
Previous by thread: Re: Matching multibyte alphabetical characters with LPeG
Next by thread: Re: Matching multibyte alphabetical characters with LPeG
Index(es):
- Date
- Thread