lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


> > Quoting https://tools.ietf.org/html/rfc3629#section-3 :
> >
> >> The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF [...]
> >>
> >> Implementations of the decoding algorithm above MUST protect against decoding invalid sequences.  For instance, a naive implementation may decode the overlong UTF-8 sequence C0 80 into the character U+0000, or the surrogate pair ED A1 8C ED BE B4 into U+233B4.
> 
> This is on purpose for interoperability. This diversion from utf8 is
> even called out (twice) on the wikipedia article for utf8 as a common
> choice
>   - https://en.wikipedia.org/wiki/UTF-8#WTF-8
>   - https://en.wikipedia.org/wiki/UTF-8#Invalid_code_points
> 
> Note that this does mean that if you need to validate a string is
> totally valid utf8 then you need to check for surrogate pairs.
> You can find an example of this in lua-http's websocket library:
> https://github.com/daurnimator/lua-http/blob/0b54603bfc132dcb9add76a61d3b50b4439031b2/http/websocket.lua#L89

We are considering adding a "strict" option to 'utf8.len', to ensure the
above rules. Nevertheless, note that Lua does not decode the overlong
sequence C0-80 into U+0000, as it does not decode the pair ED-A1-8C
ED-BE-B4 into U+233B4.

-- Roberto