Re: Matching multibyte alphabetical characters with LPeG

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Matching multibyte alphabetical characters with LPeG
From: Miles Bader <miles@...>
Date: Mon, 18 Jun 2012 08:13:30 +0900

Jay Carlson <nop@nop.com> writes:
>>> Is there an easy way to match non-ASCII alphabetical characters with LPeG?
>> 
>> No -- it's not so hard to parse utf-8 characters, but testing a
>> property like "alphabetic" requires unicode tables, which are a huge
>> and bloated dependency.
>
> Only in Lua would this be considered bloated:
>
> sunk-cost:slnunicode-1.1a nop$ size slnudata.o
> __TEXT	__DATA	__OBJC	others	dec	hex
> 0	14012	0	0	14012	36bc
>
> No, it does not provide enough to write a bidi renderer, but it does
> characterize each code point as one of 30 classes--and includes
> toupper/tolower/totitlecase.
>
> http://files.luaforge.net/releases/sln/slnunicode

Hmm, slnunicode seems to use clever techniques to compress the table
(note, though, that it only supports the BMP [16-bit characters],
which is kind of a lose ... this is 2012, people!).

Still, it's functionality that's best left to a separate library, not
something that should be in LPEG.

> There's still the grapheme problem for å vs å; hopefully you can't
> tell the second is "a".."␣̊". [1]
>
> How should lpeg match the one with a separate combining mark version
> against character classes?

Note that it's generally only the first character in such a sequence
whose attributes really matter; combining marks are just sort of
tacked on.  If someone cares about such things (e.g. they care about
splitting off a single "character", including combining marks), they
can easily handle combining marks at a higher level with LPEG.

-miles

-- 
97% of everything is grunge

Follow-Ups:
- Re: Matching multibyte alphabetical characters with LPeG, Jay Carlson

References:
- Matching multibyte alphabetical characters with LPeG, Hinrik Örn Sigurðsson
- Re: Matching multibyte alphabetical characters with LPeG, Miles Bader
- Re: Matching multibyte alphabetical characters with LPeG, Jay Carlson

Prev by Date: Updated patches for Lua 5.2.1
Next by Date: Re: Matching multibyte alphabetical characters with LPeG
Previous by thread: Re: Matching multibyte alphabetical characters with LPeG
Next by thread: Re: Matching multibyte alphabetical characters with LPeG
Index(es):
- Date
- Thread