Re: Should Lua be more strict about Unicode errors?

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Should Lua be more strict about Unicode errors?
From: Ross Berteig <Ross@...>
Date: Tue, 8 Sep 2015 11:39:56 -0700

On 9/4/2015 2:38 PM, Coda Highland wrote:

Besides, the standard maxim in these cases is "be liberal in what you
accept; be conservative in what you send." Why should you throw an
error when reading data that diverges from the standard if the result
is still meaningful? Sure, don't GENERATE these UTF-8 codes, but don't
barf on them either.

While I endorse the maxim most of the time, the restrictions in thedefinition of UTF-8 are there for a specific reason: to require thateach valid Unicode code point have exactly one valid UTF-8representation. That is part of a defense in depth approach topreventing abuses that could occur if it were legal to write U+000000 asanything other than the single byte 0x00, or two disguise othersemantically interesting characters with names other than their usualrepresentation.

That said, the mapping of bits used by UTF-8 does naturally extend toallow representation of all 32-bit values including halves of surrogatepairs (or complete pairs) and values beyond the defined range of Unicodecode points. Given that Lua has historically treated strings as (mostly)opaque blobs, it does seem reasonable for it to be allowed to do thesame with "utf8".

Both goals could be achieved with a library routine that validates thata given utf8 string is also valid UTF-8, perhaps returning flags for thekinds of violations it found rather than just nil or false on failure.It could even optionally repair the string by merging surrogate pairs orrewriting longer sequences to the shortest possible sequence. But suchrepair is exactly the case where you must be concerned that you are notcreating the very kind of attack opportunity that was defended againstby the stricter rules.


--
Ross Berteig                               Ross@CheshireEng.com
Cheshire Engineering Corp.           http://www.CheshireEng.com/

Follow-Ups:
- Re: Should Lua be more strict about Unicode errors?, Coda Highland

References:
- Re: Should Lua be more strict about Unicode errors?, Jay Carlson
- Re: Should Lua be more strict about Unicode errors?, Dirk Laurie
- Re: Should Lua be more strict about Unicode errors?, Jay Carlson
- Re: Should Lua be more strict about Unicode errors?, Coda Highland

Prev by Date: Re: To all Lua rock maintainers (also included considerations on Lua's ecosystem and a Lua distribution)
Next by Date: Re: Should Lua be more strict about Unicode errors?
Previous by thread: Re: Should Lua be more strict about Unicode errors?
Next by thread: Re: Should Lua be more strict about Unicode errors?
Index(es):
- Date
- Thread