lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Fri, Jun 12, 2009 at 07:17:12PM +0900, Miles Bader wrote:
> Read "utf-8 char" where I wrote "unicode char".
> I don't know what Roberto wants to do, but I'm certainly not advocating
> some super thick layer that tries to completely hide the encoding.
> I'm simply suggesting some very simple helper patterns/functions for
> people that might want to build patterns that deal with utf-8 encoded
> strings.  There seems little point to me in trying to cope with obscure
> things like decomposed characters.
> Really I think that maybe (1) a predefined pattern that matches a utf-8
> encoded unicode code-point, and (2) maybe some helper functions that do
> things like translate utf-8 character class specifications into
> primitive lpeg patterns, would be enough for the vast majority of users.
clearly (1) is trivial

For character classes you need the tables as readily
available e.g. in slnunicode for around 13k.
With these you also get grapheme detection for free,
basically you add code points to the match as long as their
class is combining diacritical mark.

And I don't think that's obscure, not all have a precomposed form
and more importantly depending on source you might just get the
decomposed form and should be able to deal with it.
I reckon au contraire for matching you almost always want
grapheme sequences.

Among the harder problems with Unicode are transforming to upper-/
lowercase (some locale dependencies, especially the turkish i),
normalization (needs some more tables for precomposed forms)
and sorting (there are tons of complicated sort orders around).
But I guess almost all of matching is on the easy end.