lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]




On 2019-10-03 11:14 a.m., Philippe Verdy wrote:
The change in utf8 makes it incompatible with standards if it now accepts and decodes sequences up to 6 bytes. The only standard UTF-8 version is based on the same definition universally adopted that limits them to the 17 planes up to U+10FFFF. The "original" specification of UTF-8 was only an informative RFC, that was deprecated many years ago and never adopted as a standard. All web standards use the version copublished in the Unicode standard and in the RFC replacing it (which was approved and adopted everywhere else). I think this is a bad idea... So now applications will have to use their own libraries and check a lot of dependencies to make sure they conform and will treat erroneous data as invalid. The UTF-8 standard is so universal today that it is needed in almost all applications using the web, or filesystems. Including automated processes and systems without their own user interface but used as middlewares. Having now to rewrite it is a bad idea, especially for small devices (including iOT). You should have not changed this specification. The (extremeley rare) situations when one application may need such extension should be scoped in their own library or could have used another variant of the library. This change also makes existing libraries trying to parse and validate international texts (including the builtin pattern engine or alternate regular expression engines) to have new complications.

Anyway I really suggest that the compiling options for the Lua-5.4 "utf8" library allows keeping a setting so that the standard behavior can remain in place (i.e. 5-byte and 6-byte sequences, as well as their associated leading bytes which are invalid in the standard, should be treated like other sequences that the library will recognize as invalid, such as a valid lead byte not followed by the correct number of trail bytes, or trail bytes without any leading lead byte): treating them as invalid is expected. But now if applications have to make additional checks, this will just slow them down (or leave bugs in them with undetected cases possibly creating security holes that can be exploited).

A "secure" compiled version of Lua should have this option set by default to keep the utf8 library conforming to the standard. The old RFC behavior should then not be supported or could be added in an optional secondary library like "oldutf8" instead of "utf8". But I bet that almost now one will ever want to use that old library that will then not need to be "builtin" in the engine but provided as an optional extension and loaded only "on demand" in the code using explicit library loads.


utf8 has always accepted all sorts of invalid sequences when matching strings using the utf8 pattern.

a well-behaved program should never output invalid UTF-8 from valid UTF-8, and you still need to *explicitly* request 31-bit utf8 for decoding, so nothing has changed there.

but perhaps utf8 should be renamed to varint31? these changes are, after all, meant to reuse the same code for a small yet useful data interchange format based on 31-bit varints.

the only concern I have is over existing usage of e.g. utf8.codes(s:gsub(...)). it would probably be beneficial to make utf8.codes accept a start index before the lax switch, or otherwise enforce that the lax switch is not a number. (the start index is more appealing imo.)




Le jeu. 3 oct. 2019 à 12:21, TonyMc <afmcc@btinternet.com <mailto:afmcc@btinternet.com>> a écrit :

    Hi,

    in the recent beta announcement there is a link to the changes at
    http://www.lua.org/work/doc/#changes .

    There is a typo there: coersions should be coercions.

    Thank you for the beta!

    Tony