lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


2015-08-30 15:18 GMT+02:00 Jay Carlson <nop@nop.com>:

> For the purposes of Lua, UTF-8 is defined in RFC 3629,
> an Internet Standard. (STD 63)
>
> https://www.rfc-editor.org/info/rfc3629  I’ll quote:
>
>>    The definition of UTF-8 prohibits encoding character numbers between
>>    U+D800 and U+DFFF, which are reserved for use with the UTF-16
>>    encoding form (as surrogate pairs) and do not directly represent
>>    characters.  When encoding in UTF-8 from UTF-16 data, it is necessary
>>    to first decode the UTF-16 data to obtain character numbers, which
>>    are then encoded in UTF-8 as described above.  This contrasts with
>>    CESU-8 [CESU-8], which is a UTF-8-like encoding that is not meant for
>>    use on the Internet.
>
>
> Going back to the Lua manual:
>
>> This library provides basic support for UTF-8 encoding. It provides
>> all its functions inside the table utf8. This library does not provide
>> any support for Unicode other than the handling of the encoding.
>> Any operation that needs the meaning of a character, such as
>> character classification, is outside its scope.
>
> The validity of “\u{d800}” is not a matter of Unicode other than the
> encoding UTF-8.

I deduce that you mean "you can write '\u{d800}' but you shouldn't".

I hope you agree that if '\u{d800}' is illegal, then utf8.char(0xd800)
should also be illegal. But then the 1:1 mapping from numbers less
than 0x00110000 to strings provided by the utf8.char/utf8.codepoint
pair would fail.

The way I read the Lua manual, disallowing particular in-range
integers from being allowed as arguments is precisely the kind
of thing that is declared to be outside the scope of he utf8 library.