[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: Should Lua be more strict about Unicode errors?
- From: Ross Berteig <Ross@...>
- Date: Tue, 8 Sep 2015 13:51:30 -0700
On 9/8/2015 12:20 PM, Coda Highland wrote:
On Tue, Sep 8, 2015 at 11:39 AM, Ross Berteig <Ross@cheshireeng.com> wrote:
Both goals could be achieved with a library routine that validates that a
given utf8 string is also valid UTF-8, perhaps returning flags for the kinds
of violations it found rather than just nil or false on failure. It could
even optionally repair the string by merging surrogate pairs or rewriting
longer sequences to the shortest possible sequence. But such repair is
exactly the case where you must be concerned that you are not creating the
very kind of attack opportunity that was defended against by the stricter
This is, in fact, what I had suggested -- a function for validation,
and a function for normalization.
Of note, normalization can in fact be done in a way immune to
malfeasance. What you do with the string AFTER normalization may, of
course, be a risk, but having a syntactic normalization pass before a
subsequent semantic-level validation (that is, not just validating the
UTF-8 string but validating the contents of it) will make it easier to
protect against it, because post-normalization you can be sure that
problematic characters (e.g. control characters or embedded nulls) can
only have one canonical representation.
Exactly. The concern is to make sure that any normalization occurs
*before* any semantic validation at all is done. Every time. Throughout
your entire system. Otherwise, you run the risk of someone slipping
something through, and next thing you know little Bobby Tables owns
your web store...
Unfortunately, a lot of systems take a very relaxed attitude to
separating semantic content from representation on the wire. I'm still
fighting with WordPress.com to get their parsing of Markdown to behave
consistently with respect to some characters that need to be presented
as HTML entities, for instance. They clearly normalize text more than
once, and sometimes validate more than once, with stupidly wrong
results. UTF-8 is at least normalizable in a way that would stabilize
and be immune to further normalization.
Ross Berteig Ross@CheshireEng.com
Cheshire Engineering Corp. http://www.CheshireEng.com/