lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On 1 July 2017 at 03:44, Jay Carlson <nop@nop.com> wrote:
> On Jun 29, 2017, at 7:45 PM, Duane Leslie <parakleta@darkreality.org> wrote:
>
>> Also, I have noticed that the `utf8_decode` function passes the UTF-16 surrogates which are illegal codepoints, so this might also need to be fixed.
>
> Ahh, now I remember why I kept my own UTF-8 validator. Lua’s behavior seems out of conformance with RFC 3629, and this isn’t just a SHOULD in the RFC, it’s a MUST.
>
> Quoting https://tools.ietf.org/html/rfc3629#section-3 :
>
>> The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF [...]
>>
>> Implementations of the decoding algorithm above MUST protect against decoding invalid sequences.  For instance, a naive implementation may decode the overlong UTF-8 sequence C0 80 into the character U+0000, or the surrogate pair ED A1 8C ED BE B4 into U+233B4.

This is on purpose for interoperability. This diversion from utf8 is
even called out (twice) on the wikipedia article for utf8 as a common
choice
  - https://en.wikipedia.org/wiki/UTF-8#WTF-8
  - https://en.wikipedia.org/wiki/UTF-8#Invalid_code_points

Note that this does mean that if you need to validate a string is
totally valid utf8 then you need to check for surrogate pairs.
You can find an example of this in lua-http's websocket library:
https://github.com/daurnimator/lua-http/blob/0b54603bfc132dcb9add76a61d3b50b4439031b2/http/websocket.lua#L89