lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Jun 17, 2012, at 9:38 AM, Miles Bader wrote:

> Hinrik Örn Sigurðsson <hinrik.sig@gmail.com> writes:
>> I've been making a parser with LPeG and I've run into the issue that I can't
>> match non-ASCII words even though I'm using a utf8 locale. It seems that
>> "alpha" (and "alnum", etc) from lpeg.locale() don't match anything beyond ASCII.
>> See the following code:
>> 
>>    local lpeg = require 'lpeg'
>>    local locale = lpeg.locale()
>>    print(lpeg.match(lpeg.C(lpeg.P("æ")), "æ"))    --> æ
>>    print(lpeg.match(lpeg.C(locale.alpha), "æ"))   --> nil
>> 
>> Is there an easy way to match non-ASCII alphabetical characters with LPeG?
> 
> No -- it's not so hard to parse utf-8 characters, but testing a
> property like "alphabetic" requires unicode tables, which are a huge
> and bloated dependency.

Only in Lua would this be considered bloated:

sunk-cost:slnunicode-1.1a nop$ size slnudata.o
__TEXT	__DATA	__OBJC	others	dec	hex
0	14012	0	0	14012	36bc

No, it does not provide enough to write a bidi renderer, but it does characterize each code point as one of 30 classes--and includes toupper/tolower/totitlecase.

http://files.luaforge.net/releases/sln/slnunicode

There's still the grapheme problem for å vs å; hopefully you can't tell the second is "a".."␣̊". [1] 

How should lpeg match the one with a separate combining mark version against character classes?

Jay

[1]: No idea whether Mail.app is going to normalize this on the way out, so if there's only one code point in "å" I apologize.