[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Lua utf8.len violates RFC 3629? (was Re: [PATCH] Quoted String "%q" non-ascii escaping (w/ hex).)
- From: Roberto Ierusalimschy <roberto@...>
- Date: Mon, 3 Jul 2017 15:40:38 -0300
> > Quoting https://tools.ietf.org/html/rfc3629#section-3 :
> >
> >> The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF [...]
> >>
> >> Implementations of the decoding algorithm above MUST protect against decoding invalid sequences. For instance, a naive implementation may decode the overlong UTF-8 sequence C0 80 into the character U+0000, or the surrogate pair ED A1 8C ED BE B4 into U+233B4.
>
> This is on purpose for interoperability. This diversion from utf8 is
> even called out (twice) on the wikipedia article for utf8 as a common
> choice
> - https://en.wikipedia.org/wiki/UTF-8#WTF-8
> - https://en.wikipedia.org/wiki/UTF-8#Invalid_code_points
>
> Note that this does mean that if you need to validate a string is
> totally valid utf8 then you need to check for surrogate pairs.
> You can find an example of this in lua-http's websocket library:
> https://github.com/daurnimator/lua-http/blob/0b54603bfc132dcb9add76a61d3b50b4439031b2/http/websocket.lua#L89
We are considering adding a "strict" option to 'utf8.len', to ensure the
above rules. Nevertheless, note that Lua does not decode the overlong
sequence C0-80 into U+0000, as it does not decode the pair ED-A1-8C
ED-BE-B4 into U+233B4.
-- Roberto