[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: Should Lua be more strict about Unicode errors?
- From: Coda Highland <chighland@...>
- Date: Tue, 8 Sep 2015 14:08:19 -0700
On Tue, Sep 8, 2015 at 1:51 PM, Ross Berteig <Ross@cheshireeng.com> wrote:
> On 9/8/2015 12:20 PM, Coda Highland wrote:
>> On Tue, Sep 8, 2015 at 11:39 AM, Ross Berteig <Ross@cheshireeng.com>
>>> Both goals could be achieved with a library routine that validates that a
>>> given utf8 string is also valid UTF-8, perhaps returning flags for the
>>> of violations it found rather than just nil or false on failure. It could
>>> even optionally repair the string by merging surrogate pairs or rewriting
>>> longer sequences to the shortest possible sequence. But such repair is
>>> exactly the case where you must be concerned that you are not creating
>>> very kind of attack opportunity that was defended against by the stricter
>> This is, in fact, what I had suggested -- a function for validation,
>> and a function for normalization.
>> Of note, normalization can in fact be done in a way immune to
>> malfeasance. What you do with the string AFTER normalization may, of
>> course, be a risk, but having a syntactic normalization pass before a
>> subsequent semantic-level validation (that is, not just validating the
>> UTF-8 string but validating the contents of it) will make it easier to
>> protect against it, because post-normalization you can be sure that
>> problematic characters (e.g. control characters or embedded nulls) can
>> only have one canonical representation.
> Exactly. The concern is to make sure that any normalization occurs *before*
> any semantic validation at all is done. Every time. Throughout your entire
> system. Otherwise, you run the risk of someone slipping something through,
> and next thing you know little Bobby Tables owns your web store...
> : https://xkcd.com/327/
> Unfortunately, a lot of systems take a very relaxed attitude to separating
> semantic content from representation on the wire. I'm still fighting with
> WordPress.com to get their parsing of Markdown to behave consistently with
> respect to some characters that need to be presented as HTML entities, for
> instance. They clearly normalize text more than once, and sometimes validate
> more than once, with stupidly wrong results. UTF-8 is at least normalizable
> in a way that would stabilize and be immune to further normalization.
Well yes, but WordPress's security is closer to "Swiss cheese" than
"Swiss bank," so this doesn't surprise me in the slightest.