[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: Should Lua be more strict about Unicode errors?
- From: Ross Berteig <Ross@...>
- Date: Tue, 8 Sep 2015 11:39:56 -0700
On 9/4/2015 2:38 PM, Coda Highland wrote:
Besides, the standard maxim in these cases is "be liberal in what you
accept; be conservative in what you send." Why should you throw an
error when reading data that diverges from the standard if the result
is still meaningful? Sure, don't GENERATE these UTF-8 codes, but don't
barf on them either.
While I endorse the maxim most of the time, the restrictions in the
definition of UTF-8 are there for a specific reason: to require that
each valid Unicode code point have exactly one valid UTF-8
representation. That is part of a defense in depth approach to
preventing abuses that could occur if it were legal to write U+000000 as
anything other than the single byte 0x00, or two disguise other
semantically interesting characters with names other than their usual
That said, the mapping of bits used by UTF-8 does naturally extend to
allow representation of all 32-bit values including halves of surrogate
pairs (or complete pairs) and values beyond the defined range of Unicode
code points. Given that Lua has historically treated strings as (mostly)
opaque blobs, it does seem reasonable for it to be allowed to do the
same with "utf8".
Both goals could be achieved with a library routine that validates that
a given utf8 string is also valid UTF-8, perhaps returning flags for the
kinds of violations it found rather than just nil or false on failure.
It could even optionally repair the string by merging surrogate pairs or
rewriting longer sequences to the shortest possible sequence. But such
repair is exactly the case where you must be concerned that you are not
creating the very kind of attack opportunity that was defended against
by the stricter rules.
Ross Berteig Ross@CheshireEng.com
Cheshire Engineering Corp. http://www.CheshireEng.com/