[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Lua 5.4.0 beta announcement
- From: Sean Conner <sean@...>
- Date: Thu, 3 Oct 2019 23:58:07 -0400
It was thus said that the Great Gabriel Bertilson once stated:
> On Thu, Oct 3, 2019 at 9:11 PM Philippe Verdy <verdy_p@wanadoo.fr> wrote:
> >
> > OK then... But this is nearly OK except the charpattern which is very lax (including for the "extended" 31-bit definition where the pattern is overlong: the charpattern is only valid if you have first scanned the full text to validate its encoding, but charpattern cannot be used to scan the text correctly, but it will only correctly allow enumerating each lead byte, including invalid one, returning a sequence of arbitrary length that may not decode correctly as a single valid codepoint, or could map to a surrogate codepoint plus overlong trail bytes, and not necessarily paired with a following surrogate in the correct range: each sequence matched by this pattern is not necessarily valid as its lead byte may still be incorrect, and the sequence may still be overlong, or too short for the last sequence matched in the given text).
>
> Yeah, the pattern can't be used for validation. That would only be
> possible if Lua patterns allowed alternation.
Which is why we have LPEG. Speaking of which, I do have a few modules
that deal with this. All use LPEG.
org.conman.parsers.ascii
Matches one US-ASCII character (codes 0 to 127)
org.conman.parsers.ascii.char
Matches ASCII codes 20-126 (graphics set plus space)
org.conman.parsers.ascii.control
Matches the ASCII C0 control set (codes 0 to 31) plus delete
(127---technically isn't part of the C0 set)
org.conman.parsers.ascii.ctrl
Matches the ASCII C0 set (plus DEL) and translates the
character to its name:
0 - NUL
1 - SOH ...
org.conman.parsers.utf8
Matches one (or more) UTF-8 code points greater than or equal to 128
(see org.conman.parsers.utf8.control for more information)
org.conman.parsers.utf8.char
Matches one Unicode codepoint greater than or equal to 160
to the end of the Unicode defined codepoints (that is, if I
have it defined correctly).
org.conman.parsers.utf8.control
Matches the C1 control set. This include multicode
sequences like CSI, DCS, SOS, OSC, PM and APC. If these
don't mean anything to you, think terminal (or ANSI, even
though they technically aren't ANSI) escape codes.
org.conman.parers.utf8.ctrl
Parses the C1 control set, returning both the name of the
seqence, and any associated data.
Since these are LPEG patterns, they can be used in larger expressions, and
they are all available via LuaRocks.
You can check out the code at <https://github.com/spc476/LPeg-Parsers>
-spc
- References:
- Lua 5.4.0 beta announcement, TonyMc
- Re: Lua 5.4.0 beta announcement, Philippe Verdy
- Re: Lua 5.4.0 beta announcement, Soni "They/Them" L.
- Re: Lua 5.4.0 beta announcement, Roberto Ierusalimschy
- Re: Lua 5.4.0 beta announcement, Philippe Verdy
- Re: Lua 5.4.0 beta announcement, Soni "They/Them" L.
- Re: Lua 5.4.0 beta announcement, Philippe Verdy
- Re: Lua 5.4.0 beta announcement, Gabriel Bertilson
- Re: Lua 5.4.0 beta announcement, Philippe Verdy
- Re: Lua 5.4.0 beta announcement, Gabriel Bertilson