lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


The change in utf8 makes it incompatible with standards if it now accepts and decodes sequences up to 6 bytes.
The only standard UTF-8 version is based on the same definition universally adopted that limits them to the 17 planes up to U+10FFFF.
The "original" specification of UTF-8 was only an informative RFC, that was deprecated many years ago and never adopted as a standard. All web standards use the version copublished in the Unicode standard and in the RFC replacing it (which was approved and adopted everywhere else).
I think this is a bad idea... So now applications will have to use their own libraries and check a lot of dependencies to make sure they conform and will treat erroneous data as invalid.
The UTF-8 standard is so universal today that it is needed in almost all applications using the web, or filesystems. Including automated processes and systems without their own user interface but used as middlewares.
Having now to rewrite it is a bad idea, especially for small devices (including iOT).
You should have not changed this specification. The (extremeley rare) situations when one application may need such extension should be scoped in their own library or could have used another variant of the library. This change also makes existing libraries trying to parse and validate international texts (including the builtin pattern engine or alternate regular _expression_ engines) to have new complications.

Anyway I really suggest that the compiling options for the Lua-5.4 "utf8" library allows keeping a setting so that the standard behavior can remain in place (i.e. 5-byte and 6-byte sequences, as well as their associated leading bytes which are invalid in the standard, should be treated like other sequences that the library will recognize as invalid, such as a valid lead byte not followed by the correct number of trail bytes, or trail bytes without any leading lead byte): treating them as invalid is expected. But now if applications have to make additional checks, this will just slow them down (or leave bugs in them with undetected cases possibly creating security holes that can be exploited).

A "secure" compiled version of Lua should have this option set by default to keep the utf8 library conforming to the standard. The old RFC behavior should then not be supported or could be added in an optional secondary library like "oldutf8" instead of "utf8". But I bet that almost now one will ever want to use that old library that will then not need to be "builtin" in the engine but provided as an optional extension and loaded only "on demand" in the code using explicit library loads.




Le jeu. 3 oct. 2019 à 12:21, TonyMc <afmcc@btinternet.com> a écrit :
Hi,

in the recent beta announcement there is a link to the changes at
http://www.lua.org/work/doc/#changes .

There is a typo there: coersions should be coercions.

Thank you for the beta!

Tony