lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


But the old 31-bit non-standard variant uses a very lax pattern, that would be invalid with standard UTF-8.
It's not acceptable to validate any string that starts with a lead byte valid only for the old 31-bit variant.
The valid patterns are documented in TUS chapter 3 (conformance) and the standard RFC; don't trust the old proposed RFC which was never accepted.
Otherwise there is the risk of validating an input that will break when converting it to UTF-16, creating dangerous collisions or unexpected errors.
Standard UTF-8 means it is FULLY interoperable with ALL standard UTFs (in which the old proposed RFC 31-bit variant has never been part of).
There are extremely limited uses for deviating from the UTF-8 standard, and these use cases should be strictly isolated, without having to rewrite a lot of code depending on the standard conformance, and only these rare use cases should use their own encoding libraries (under another name than "utf8", see for example the variant used in Java for JNI, which is NOT labelled "utf8", except in a legacy field name used in legacy source code written in C: even Java no longer uses this binding which is purely local to the internal legacy storage format of compiled classes; otherwise it uses another (de)coder but not the one named "UTF-8", and its only difference was about the way it represents U+0000 NUL, as <0xC0,0x80> instead of a single byte <0x00>, so that it can be used with legacy C string types terminated by a null byte; modern string libraries don't use null-byte termination but byte buffers with a length property, and modern JNI applications no longer use the old interfaces based on 8-bit code units, but 16-bit code units instead; this also applies to other languages like _javascript_/ECMIScript/TypeStript/_vbscript_ which also don't depend on null-byte termination to know the length of strings for any encodings and code unit sizes they can support, including notably Python).

So I don't undersand the rationale for changing the utf8 libary for this "extended" but non-standard behavior that can just cause havoc and create severe security problems in applications that would then need to be deeply modified in many places to force them to revalidate their input and make sure that theyr will remain interoperable.


Le jeu. 3 oct. 2019 à 23:05, Roberto Ierusalimschy <roberto@inf.puc-rio.br> a écrit :
> the only concern I have is over existing usage of e.g.
> utf8.codes(s:gsub(...)). it would probably be beneficial to make utf8.codes
> accept a start index before the lax switch, or otherwise enforce that the
> lax switch is not a number. (the start index is more appealing imo.)

Accepting a start index would not help in that case, would it?

-- Roberto