lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On 01/04/11 17:37, Marc Balmer wrote:
[...]
> On the C level, quite a lot.  strlen() and friends can no longer be
> used, printf format strings like "%20s" don't work anymore etc.  Not to
> speak about string comparison, collation etc.  Since I am not familiar
> with LPeg's implementation, that is about all I can say.

Determining the length of a Unicode string is a pretty fuzzy concept
anyway --- AFAIK the only way to do it is to break it up into grapheme
clusters and determine the size of each grapheme cluster individually
(which may vary according to font).

I tend to use a cheap and nasty mechanism for console applications that
assumes that each code point is a grapheme cluster, and then uses a set
of rules to decide whether they're of width 1 and 2. This works most of
the time but not all of the time. See:

http://wordgrinder.hg.sourceforge.net/hgweb/wordgrinder/wordgrinder/file/f658d1e8f1f3/src/c/emu/wcwidth.c

In terms of what I'd like from LPEG is a set of primitives for matching
a single code point and a single grapheme cluster (treating them as Lua
strings, i.e. sequences of bytes). This would allow easier parsing of
UTF-8 strings. The collation stuff might be useful but not only is it
hideously complicated and involving massive tables, but I've never
actually found a need for it, so I'd willing to live without it.

-- 
┌─── dg@cowlark.com ───── http://www.cowlark.com ─────
│ "I have always wished for my computer to be as easy to use as my
│ telephone; my wish has come true because I can no longer figure out
│ how to use my telephone." --- Bjarne Stroustrup

Attachment: signature.asc
Description: OpenPGP digital signature