[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: Matching multibyte alphabetical characters with LPeG
- From: Miles Bader <miles@...>
- Date: Mon, 18 Jun 2012 08:13:30 +0900
Jay Carlson <firstname.lastname@example.org> writes:
>>> Is there an easy way to match non-ASCII alphabetical characters with LPeG?
>> No -- it's not so hard to parse utf-8 characters, but testing a
>> property like "alphabetic" requires unicode tables, which are a huge
>> and bloated dependency.
> Only in Lua would this be considered bloated:
> sunk-cost:slnunicode-1.1a nop$ size slnudata.o
> __TEXT __DATA __OBJC others dec hex
> 0 14012 0 0 14012 36bc
> No, it does not provide enough to write a bidi renderer, but it does
> characterize each code point as one of 30 classes--and includes
Hmm, slnunicode seems to use clever techniques to compress the table
(note, though, that it only supports the BMP [16-bit characters],
which is kind of a lose ... this is 2012, people!).
Still, it's functionality that's best left to a separate library, not
something that should be in LPEG.
> There's still the grapheme problem for å vs å; hopefully you can't
> tell the second is "a".."␣̊". 
> How should lpeg match the one with a separate combining mark version
> against character classes?
Note that it's generally only the first character in such a sequence
whose attributes really matter; combining marks are just sort of
tacked on. If someone cares about such things (e.g. they care about
splitting off a single "character", including combining marks), they
can easily handle combining marks at a higher level with LPEG.
97% of everything is grunge