On 2019-10-03 6:21 p.m., Philippe Verdy wrote:
> So I don't undersand the rationale for changing the utf8 libary for
> this "extended" but non-standard behavior that can just cause havoc
> and create severe security problems in applications that would then
> need to be deeply modified in many places to force them to revalidate
> their input and make sure that theyr will remain interoperable.
I could rant about this all day but tl;dr: never, ever ever ever ever
ever, sanitize before decoding. just don't.
And that's not what I advocated. I never said "sanitize before decoding", but "validate". Because the new utf8 implementation will no longer validate it and will decode INVALID non standard UTF-8 as if it was standard, this will cause havoc everywhere down the stream, effectively forcing existing apps to spread code everywhere to revalidate constantly.
There's absolutely no good reason in Lua 5.4 to accept the behavior old-proposed non standard RFC: there must not be any codepoint higher than 0x10FFFF and the 31-bit "extension" is unsafe, I strongly vote against it, and those rare applications that will want to use the old RFC will have to use their own separate implementation in a separate library, not the builtin "utf8" library that should remain as clean as possible and will safely invalidate non standard codes (if needed the "utf8" library may use additional optional parameter to explicitly request the lax behavior.