lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

On 9/8/2015 12:20 PM, Coda Highland wrote:
On Tue, Sep 8, 2015 at 11:39 AM, Ross Berteig <> wrote:
Both goals could be achieved with a library routine that validates that a
given utf8 string is also valid UTF-8, perhaps returning flags for the kinds
of violations it found rather than just nil or false on failure. It could
even optionally repair the string by merging surrogate pairs or rewriting
longer sequences to the shortest possible sequence. But such repair is
exactly the case where you must be concerned that you are not creating the
very kind of attack opportunity that was defended against by the stricter

This is, in fact, what I had suggested -- a function for validation,
and a function for normalization.

Of note, normalization can in fact be done in a way immune to
malfeasance. What you do with the string AFTER normalization may, of
course, be a risk, but having a syntactic normalization pass before a
subsequent semantic-level validation (that is, not just validating the
UTF-8 string but validating the contents of it) will make it easier to
protect against it, because post-normalization you can be sure that
problematic characters (e.g. control characters or embedded nulls) can
only have one canonical representation.

Exactly. The concern is to make sure that any normalization occurs *before* any semantic validation at all is done. Every time. Throughout your entire system. Otherwise, you run the risk of someone slipping something through, and next thing you know little Bobby Tables[1] owns your web store...


Unfortunately, a lot of systems take a very relaxed attitude to separating semantic content from representation on the wire. I'm still fighting with to get their parsing of Markdown to behave consistently with respect to some characters that need to be presented as HTML entities, for instance. They clearly normalize text more than once, and sometimes validate more than once, with stupidly wrong results. UTF-8 is at least normalizable in a way that would stabilize and be immune to further normalization.

Ross Berteig                     
Cheshire Engineering Corp.