lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

On Tue, Sep 8, 2015 at 11:39 AM, Ross Berteig <> wrote:
> On 9/4/2015 2:38 PM, Coda Highland wrote:
>> Besides, the standard maxim in these cases is "be liberal in what you
>> accept; be conservative in what you send." Why should you throw an
>> error when reading data that diverges from the standard if the result
>> is still meaningful? Sure, don't GENERATE these UTF-8 codes, but don't
>> barf on them either.
> While I endorse the maxim most of the time, the restrictions in the
> definition of UTF-8 are there for a specific reason: to require that each
> valid Unicode code point have exactly one valid UTF-8 representation. That
> is part of a defense in depth approach to preventing abuses that could occur
> if it were legal to write U+000000 as anything other than the single byte
> 0x00, or two disguise other semantically interesting characters with names
> other than their usual representation.
> That said, the mapping of bits used by UTF-8 does naturally extend to allow
> representation of all 32-bit values including halves of surrogate pairs (or
> complete pairs) and values beyond the defined range of Unicode code points.
> Given that Lua has historically treated strings as (mostly) opaque blobs, it
> does seem reasonable for it to be allowed to do the same with "utf8".
> Both goals could be achieved with a library routine that validates that a
> given utf8 string is also valid UTF-8, perhaps returning flags for the kinds
> of violations it found rather than just nil or false on failure. It could
> even optionally repair the string by merging surrogate pairs or rewriting
> longer sequences to the shortest possible sequence. But such repair is
> exactly the case where you must be concerned that you are not creating the
> very kind of attack opportunity that was defended against by the stricter
> rules.

This is, in fact, what I had suggested -- a function for validation,
and a function for normalization.

Of note, normalization can in fact be done in a way immune to
malfeasance. What you do with the string AFTER normalization may, of
course, be a risk, but having a syntactic normalization pass before a
subsequent semantic-level validation (that is, not just validating the
UTF-8 string but validating the contents of it) will make it easier to
protect against it, because post-normalization you can be sure that
problematic characters (e.g. control characters or embedded nulls) can
only have one canonical representation.

/s/ Adam