lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


OK then... But this is nearly OK except the charpattern which is very lax (including for the "extended" 31-bit definition where the pattern is overlong: the charpattern is only valid if you have first scanned the full text to validate its encoding, but charpattern cannot be used to scan the text correctly, but it will only correctly allow enumerating each lead byte, including invalid one, returning a sequence of arbitrary length that may not decode correctly as a single valid codepoint, or could map to a surrogate codepoint plus overlong trail bytes, and not necessarily paired with a following surrogate in the correct range: each sequence matched by this pattern is not necessarily valid as its lead byte may still be incorrect, and the sequence may still be overlong, or too short for the last sequence matched in the given text).
This should be specified more clearly.
I've never used such lax charpattern which can match arbitrarily long sequences (containing an unlimited number of trail bytes), when it was possible to detect the overlong sequence much earlier, without even having to perform any (costly) loop. non lax patterns match at most 4 bytes, don't require any internal buffering or scanning possibly indefinitely (such unlimited pattern can be used to create time-based attacks even if there are not buffering problems, or could cause internal memory problems).
I could not recommend using any code depending on this charpattern (not needed at all to safely validate any input against unsane contents)

Le ven. 4 oct. 2019 à 02:30, Gabriel Bertilson <arboreous.philologist@gmail.com> a écrit :
On Thu, Oct 3, 2019 at 7:23 PM Philippe Verdy <verdy_p@wanadoo.fr> wrote:
> if needed the "utf8" library may use additional optional parameter to explicitly request the lax behavior

That's exactly what several of the utf8 functions do. See
https://www.lua.org/work/doc/manual.html#6.5.

— Gabriel