lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


It was thus said that the Great Gabriel Bertilson once stated:
> On Thu, Oct 3, 2019 at 9:11 PM Philippe Verdy <verdy_p@wanadoo.fr> wrote:
> >
> > OK then... But this is nearly OK except the charpattern which is very lax (including for the "extended" 31-bit definition where the pattern is overlong: the charpattern is only valid if you have first scanned the full text to validate its encoding, but charpattern cannot be used to scan the text correctly, but it will only correctly allow enumerating each lead byte, including invalid one, returning a sequence of arbitrary length that may not decode correctly as a single valid codepoint, or could map to a surrogate codepoint plus overlong trail bytes, and not necessarily paired with a following surrogate in the correct range: each sequence matched by this pattern is not necessarily valid as its lead byte may still be incorrect, and the sequence may still be overlong, or too short for the last sequence matched in the given text).
> 
> Yeah, the pattern can't be used for validation. That would only be
> possible if Lua patterns allowed alternation.

  Which is why we have LPEG.  Speaking of which, I do have a few modules
that deal with this.  All use LPEG.

	org.conman.parsers.ascii
		Matches one US-ASCII character (codes 0 to 127)

	org.conman.parsers.ascii.char
		Matches ASCII codes 20-126 (graphics set plus space)

	org.conman.parsers.ascii.control
		Matches the ASCII C0 control set (codes 0 to 31) plus delete
		(127---technically isn't part of the C0 set)

	org.conman.parsers.ascii.ctrl
		Matches the ASCII C0 set (plus DEL) and translates the
		character to its name:

			0 - NUL
			1 - SOH ...

	org.conman.parsers.utf8
		Matches one (or more) UTF-8 code points greater than or equal to 128
		(see org.conman.parsers.utf8.control for more information)

	org.conman.parsers.utf8.char
		Matches one Unicode codepoint greater than or equal to 160
		to the end of the Unicode defined codepoints (that is, if I
		have it defined correctly).

	org.conman.parsers.utf8.control
		Matches the C1 control set.  This include multicode
		sequences like CSI, DCS, SOS, OSC, PM and APC.  If these
		don't mean anything to you, think terminal (or ANSI, even
		though they technically aren't ANSI) escape codes.

	org.conman.parers.utf8.ctrl
		Parses the C1 control set, returning both the name of the
		seqence, and any associated data.

  Since these are LPEG patterns, they can be used in larger expressions, and
they are all available via LuaRocks.

  You can check out the code at <https://github.com/spc476/LPeg-Parsers>

  -spc