Re: LPeg support for utf-8

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: LPeg support for utf-8
From: Tony Finch <dot@...>
Date: Wed, 6 Apr 2011 10:52:02 +0100

Thomas Harning Jr. <harningt@gmail.com> wrote:
>
> As for suggestions on implementation... I would recommend the following:
>  uP (codepoint) - receive codepoint as an integer and return the
> appropriate literal match

This is a bit confusing since it clashes with the existing lpeg.P
function. Also it has exactly the same semantics as your suggested
lpeg.uS.

>  uR (start_codepoint1, end_codepoint1, (start_codepointx,
> end_codepointx)*) - receive pairs of codepoints as integers and create
> an optimal range match
>  uR (table of codepoint pairs) - effectively the same as uR with the
> table unpacked... not sure if this would be important enough
>  uS (codepoint_1, (codepoint_x)*) - returns a set match on each of the
> listed codepoints
>  uS (table of codepoints) - returns a set match on the list of
> codepoints in the table

All of these should take UTF-8 strings as arguments from which lpeg should
extract a list of codepoints or codepoint pairs, as in the existing lpeg.R
and lpeg.S functions.

> A utility to return the utf-8 representation of a given codepoint
> would also be useful in testing (both LPeg itself and relying
> applications) to avoid having to roll your own tool to encode utf-8.

Actually if you have this function then the expanded versions of the lpeg
functions that you suggested would be redundant. You sould just write
lpeg.uS(toutf8(8,10,13,32,0xA0,0x202F,0x205F,0x3000,0xFEFF)) + lpeg.uR(toutf8(0x2000,0x200b))

Tony.
-- 
f.anthony.n.finch  <dot@dotat.at>  http://dotat.at/
German Bight, Humber, Thames, Dover: Southwest 5 or 6, increasing 7 for a time
in German Bight, veering west 4 or 5 later. Slight or moderate, occasionally
rough in German Bight. Fog patches. Moderate or good, occasionally very poor.

References:
- LPeg support for utf-8, Roberto Ierusalimschy
- Re: LPeg support for utf-8, Tony Finch
- Re: LPeg support for utf-8, Roberto Ierusalimschy
- Re: LPeg support for utf-8, Thomas Harning Jr.

Prev by Date: Re: Possible bug with the length operator
Next by Date: Re: os.time() vs. isdst
Previous by thread: Re: LPeg support for utf-8
Next by thread: Re: LPeg support for utf-8
Index(es):
- Date
- Thread