[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: LPeg support for utf-8
- From: Tony Finch <dot@...>
- Date: Wed, 6 Apr 2011 10:52:02 +0100
Thomas Harning Jr. <harningt@gmail.com> wrote:
>
> As for suggestions on implementation... I would recommend the following:
> uP (codepoint) - receive codepoint as an integer and return the
> appropriate literal match
This is a bit confusing since it clashes with the existing lpeg.P
function. Also it has exactly the same semantics as your suggested
lpeg.uS.
> uR (start_codepoint1, end_codepoint1, (start_codepointx,
> end_codepointx)*) - receive pairs of codepoints as integers and create
> an optimal range match
> uR (table of codepoint pairs) - effectively the same as uR with the
> table unpacked... not sure if this would be important enough
> uS (codepoint_1, (codepoint_x)*) - returns a set match on each of the
> listed codepoints
> uS (table of codepoints) - returns a set match on the list of
> codepoints in the table
All of these should take UTF-8 strings as arguments from which lpeg should
extract a list of codepoints or codepoint pairs, as in the existing lpeg.R
and lpeg.S functions.
> A utility to return the utf-8 representation of a given codepoint
> would also be useful in testing (both LPeg itself and relying
> applications) to avoid having to roll your own tool to encode utf-8.
Actually if you have this function then the expanded versions of the lpeg
functions that you suggested would be redundant. You sould just write
lpeg.uS(toutf8(8,10,13,32,0xA0,0x202F,0x205F,0x3000,0xFEFF)) + lpeg.uR(toutf8(0x2000,0x200b))
Tony.
--
f.anthony.n.finch <dot@dotat.at> http://dotat.at/
German Bight, Humber, Thames, Dover: Southwest 5 or 6, increasing 7 for a time
in German Bight, veering west 4 or 5 later. Slight or moderate, occasionally
rough in German Bight. Fog patches. Moderate or good, occasionally very poor.