lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


> On 2015-08-30, at 8:53 AM, Dirk Laurie <dirk.laurie@gmail.com> wrote:
> 
> 2015-08-30 14:30 GMT+02:00 Soni L. <fakedme@gmail.com>:
>> LuaJIT recently added Lua 5.3's "\u{}" escapes. It's also more strict about
>> Unicode errors than Lua 5.3[1].
>> 
>> For example, "\u{d800}" is valid in Lua 5.3, but not in LuaJIT.
>> 
>> Should Lua be more strict about Unicode errors?

Yes, if what you mean is UTF-8 errors.

> Why should it be invalid?

For the purposes of Lua, UTF-8 is defined in RFC 3629, an Internet Standard. (STD 63)

https://www.rfc-editor.org/info/rfc3629  I’ll quote:

>    The definition of UTF-8 prohibits encoding character numbers between
>    U+D800 and U+DFFF, which are reserved for use with the UTF-16
>    encoding form (as surrogate pairs) and do not directly represent
>    characters.  When encoding in UTF-8 from UTF-16 data, it is necessary
>    to first decode the UTF-16 data to obtain character numbers, which
>    are then encoded in UTF-8 as described above.  This contrasts with
>    CESU-8 [CESU-8], which is a UTF-8-like encoding that is not meant for
>    use on the Internet.


Going back to the Lua manual:

> This library provides basic support for UTF-8 encoding. It provides
> all its functions inside the table utf8. This library does not provide
> any support for Unicode other than the handling of the encoding.
> Any operation that needs the meaning of a character, such as
> character classification, is outside its scope.

The validity of “\u{d800}” is not a matter of Unicode other than the encoding UTF-8.

Jay