lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Thu, Oct 3, 2019 at 9:11 PM Philippe Verdy <verdy_p@wanadoo.fr> wrote:
>
> OK then... But this is nearly OK except the charpattern which is very lax (including for the "extended" 31-bit definition where the pattern is overlong: the charpattern is only valid if you have first scanned the full text to validate its encoding, but charpattern cannot be used to scan the text correctly, but it will only correctly allow enumerating each lead byte, including invalid one, returning a sequence of arbitrary length that may not decode correctly as a single valid codepoint, or could map to a surrogate codepoint plus overlong trail bytes, and not necessarily paired with a following surrogate in the correct range: each sequence matched by this pattern is not necessarily valid as its lead byte may still be incorrect, and the sequence may still be overlong, or too short for the last sequence matched in the given text).

Yeah, the pattern can't be used for validation. That would only be
possible if Lua patterns allowed alternation.

> I could not recommend using any code depending on this charpattern (not needed at all to safely validate any input against unsane contents)

utf8.charpattern works as advertised, matching a valid byte sequence
for a code point if you apply it to a valid UTF-8 string. I use it
often to match a code point in Lua patterns. There's not another
pattern that would do the same job. The Lua 5.4 version, which now
includes leading bytes for 5- and 6-byte sequences, will work just as
well under the same conditions, so I'll have no qualms using it.

I don't expect utf8.charpattern to validate my UTF-8 for me. utf8.len
can be used for that instead; in Lua 5.4 it rejects encodings of
surrogate code points (which Lua 5.3 doesn't) and encodings of values
that are bigger than a codepoint (such as 0xFFFFFF), unless you ask
for the lax behavior.

It does weird me out that "\u{FFFFFF}" works now though.

> This should be specified more clearly.

The current description of utf8.charpattern, "matches exactly one
UTF-8 byte sequence, assuming that the subject is a valid UTF-8
string", is clear, for those who understand what "valid UTF-8" means,
and that if you apply the pattern to invalid UTF-8 you get undefined
behavior, in the tradition of C. What more would you like it to say?

— Gabriel