lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

David Given <> writes:
>> It seems there needs to be a clear distinction between "raw char" (given
>> that lpeg is quite usable for binary data) and "unicode char".
> The problem is that Unicode doesn't really have any such concept as a
> character', which means that traditional string handling methods
> basically don't work with it (even if you ignore UTF-8 encoding). A
> single displayable thing can actually be made up of several Unicode code
> points, and may even have several different (but technically equivalent)
> representations.

Read "utf-8 char" where I wrote "unicode char".

I don't know what Roberto wants to do, but I'm certainly not advocating
some super thick layer that tries to completely hide the encoding.

I'm simply suggesting some very simple helper patterns/functions for
people that might want to build patterns that deal with utf-8 encoded
strings.  There seems little point to me in trying to cope with obscure
things like decomposed characters.

Really I think that maybe (1) a predefined pattern that matches a utf-8
encoded unicode code-point, and (2) maybe some helper functions that do
things like translate utf-8 character class specifications into
primitive lpeg patterns, would be enough for the vast majority of users.

E.g., have lpeg.U8P(...) and lpeg.U8S(...), which do pretty much the
simplest mechanical translation.

>> Making P(x) count utf8 chars would certainly be convenient for people
>> reading utf8 files, but... it doesn't seem the cleanest thing in
>> general....
> *Nothing* about Unicode is clean...

Unicode definitely has lots of sucky points, but frankly, it's one of
these "the worst solution out there, except for all the others"


In New York, most people don't have cars, so if you want to kill a person, you
have to take the subway to their house.  And sometimes on the way, the train
is delayed and you get impatient, so you have to kill someone on the subway.
  [George Carlin]