[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Should Lua be more strict about Unicode errors?
- From: Jay Carlson <nop@...>
- Date: Sun, 30 Aug 2015 09:18:03 -0400
> On 2015-08-30, at 8:53 AM, Dirk Laurie <dirk.laurie@gmail.com> wrote:
>
> 2015-08-30 14:30 GMT+02:00 Soni L. <fakedme@gmail.com>:
>> LuaJIT recently added Lua 5.3's "\u{}" escapes. It's also more strict about
>> Unicode errors than Lua 5.3[1].
>>
>> For example, "\u{d800}" is valid in Lua 5.3, but not in LuaJIT.
>>
>> Should Lua be more strict about Unicode errors?
Yes, if what you mean is UTF-8 errors.
> Why should it be invalid?
For the purposes of Lua, UTF-8 is defined in RFC 3629, an Internet Standard. (STD 63)
https://www.rfc-editor.org/info/rfc3629 I’ll quote:
> The definition of UTF-8 prohibits encoding character numbers between
> U+D800 and U+DFFF, which are reserved for use with the UTF-16
> encoding form (as surrogate pairs) and do not directly represent
> characters. When encoding in UTF-8 from UTF-16 data, it is necessary
> to first decode the UTF-16 data to obtain character numbers, which
> are then encoded in UTF-8 as described above. This contrasts with
> CESU-8 [CESU-8], which is a UTF-8-like encoding that is not meant for
> use on the Internet.
Going back to the Lua manual:
> This library provides basic support for UTF-8 encoding. It provides
> all its functions inside the table utf8. This library does not provide
> any support for Unicode other than the handling of the encoding.
> Any operation that needs the meaning of a character, such as
> character classification, is outside its scope.
The validity of “\u{d800}” is not a matter of Unicode other than the encoding UTF-8.
Jay