David Given <> writes:
>> It seems there needs to be a clear distinction between "raw char" (given
>> that lpeg is quite usable for binary data) and "unicode char".
> The problem is that Unicode doesn't really have any such concept as a
> character', which means that traditional string handling methods
> basically don't work with it (even if you ignore UTF-8 encoding). A
> single displayable thing can actually be made up of several Unicode code
> points, and may even have several different (but technically equivalent)
> representations.

Read "utf-8 char" where I wrote "unicode char".

I don't know what Roberto wants to do, but I'm certainly not advocating
some super thick layer that tries to completely hide the encoding.

I'm simply suggesting some very simple helper patterns/functions for
people that might want to build patterns that deal with utf-8 encoded
strings.  There seems little point to me in trying to cope with obscure
things like decomposed characters.

Really I think that maybe (1) a predefined pattern that matches a utf-8
encoded unicode code-point, and (2) maybe some helper functions that do
things like translate utf-8 character class specifications into
primitive lpeg patterns, would be enough for the vast majority of users.

E.g., have lpeg.U8P(...) and lpeg.U8S(...), which do pretty much the
simplest mechanical translation.

>> Making P(x) count utf8 chars would certainly be convenient for people
>> reading utf8 files, but... it doesn't seem the cleanest thing in
>> general....
> *Nothing* about Unicode is clean...

Unicode definitely has lots of sucky points, but frankly, it's one of
these "the worst solution out there, except for all the others"


