Re: LPeg support for utf-8

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: LPeg support for utf-8
From: "Thomas Harning Jr." <harningt@...>
Date: Tue, 5 Apr 2011 23:03:23 -0400

On Fri, Apr 1, 2011 at 2:28 PM, Roberto Ierusalimschy
<roberto@inf.puc-rio.br> wrote:
>> Roberto Ierusalimschy <roberto@inf.puc-rio.br> wrote:
>>
>> > A quick survey, for those who care:
>> > - should LPeg support utf-8?
>> > - If so, what would that mean?
>>
>> An alternative to lpeg.P(N) which matches N UTF-8 encoded code points
>> instead of octets. Similarly, alternatives to lpeg.R and lpeg.S that deal
>> with code points instead of octets. Maybe lpeg.uP and .uS and .uR ?
>> Perhaps there should be a .uB as well. I would prefer this to a "unicode
>> mode" which changes the behaviour of the existing funcctions.
>
> This is more ore less what I had in mind (specific names not
> withstanding). But still remains the question of whether each of these
> constructions (uS, uR, etc.) is really useful and whether there should
> be others. For instance, would it be worth to support something like
> properties (using wctype)? Or a capture that matches one code point
> and catures its value?
I think the utilities of uP, uS and uR would be very valuable. Right
now in LuaJSON, I have a set of hand-prepared UTF-8 character
sequences to attempt to optimize a pattern matching a series of
whitespace characters split through the Unicode table.  uS and uR,
while complicated, could offer an efficient matching set that would
optimize the output pattern based on the desired codepoints.

Ex:
chr = string.char
P, S = lpeg.P, lpeg.S
a match for both U+0085 and U+00A0 could effectively do:
  P(chr(0xC2)) * S(chr(0x85) .. chr(0xA0))
... or whatever is the most optimal means of matching a common prefix.
a match for common prefixes and intermediate characters could also be done...

As for cost of implementation, I haven't entirely worked out a mental
model of how one would take a set of N codepoints or a range and find
the optimal pattern... unless there is an easier mechanism to
implement directly in C.

As for suggestions on implementation... I would recommend the following:
 uP (codepoint) - receive codepoint as an integer and return the
appropriate literal match
 uR (start_codepoint1, end_codepoint1, (start_codepointx,
end_codepointx)*) - receive pairs of codepoints as integers and create
an optimal range match
 uR (table of codepoint pairs) - effectively the same as uR with the
table unpacked... not sure if this would be important enough
 uS (codepoint_1, (codepoint_x)*) - returns a set match on each of the
listed codepoints
 uS (table of codepoints) - returns a set match on the list of
codepoints in the table

A utility to return the utf-8 representation of a given codepoint
would also be useful in testing (both LPeg itself and relying
applications) to avoid having to roll your own tool to encode utf-8.
-- 
Thomas Harning Jr.

Follow-Ups:
- Re: LPeg support for utf-8, Tony Finch

References:
- LPeg support for utf-8, Roberto Ierusalimschy
- Re: LPeg support for utf-8, Tony Finch
- Re: LPeg support for utf-8, Roberto Ierusalimschy

Prev by Date: Re: os.time() vs. isdst
Next by Date: Re: Possible bug with the length operator
Previous by thread: Re: LPeg support for utf-8
Next by thread: Re: LPeg support for utf-8
Index(es):
- Date
- Thread