[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Should Lua be more strict about Unicode errors?
- From: Jay Carlson <nop@...>
- Date: Wed, 2 Sep 2015 12:03:15 -0400
> On 2015-08-30, at 2:35 PM, Dirk Laurie <dirk.laurie@gmail.com> wrote:
>
> 2015-08-30 15:18 GMT+02:00 Jay Carlson <nop@nop.com>:
>
>> For the purposes of Lua, UTF-8 is defined in RFC 3629,
>> an Internet Standard. (STD 63)
>>
>> https://www.rfc-editor.org/info/rfc3629 I’ll quote:
>>
>>> The definition of UTF-8 prohibits encoding character numbers between
>>> U+D800 and U+DFFF, which are reserved for use with the UTF-16
>>> encoding form (as surrogate pairs) and do not directly represent
>>> characters. When encoding in UTF-8 from UTF-16 data, it is necessary
>>> to first decode the UTF-16 data to obtain character numbers, which
>>> are then encoded in UTF-8 as described above. This contrasts with
>>> CESU-8 [CESU-8], which is a UTF-8-like encoding that is not meant for
>>> use on the Internet.
>>
>>
>> Going back to the Lua manual:
>>
>>> This library provides basic support for UTF-8 encoding. It provides
>>> all its functions inside the table utf8. This library does not provide
>>> any support for Unicode other than the handling of the encoding.
>>> Any operation that needs the meaning of a character, such as
>>> character classification, is outside its scope.
>>
>> The validity of “\u{d800}” is not a matter of Unicode other than the
>> encoding UTF-8.
>
> I deduce that you mean "you can write '\u{d800}' but you shouldn't".
It must produce undefined behavior, as there is no UTF-8 sequence corresponding to 0xD800. From general Lua philosophy, I would guess that it would provoke a syntax error, or contribute some unknown but bounded sequence of octets to the string. In other words, it would *probably* not provoke C's undefined behavior.
> I hope you agree that if '\u{d800}' is illegal, then utf8.char(0xd800)
> should also be illegal.
It should be more illegal. :-) 0xd800 is outside the domain of any function converting codepoints to UTF-8. What possible UTF-8 string can it return? I am an Errorist, so you know what I think it should do.
> But then the 1:1 mapping from numbers less
> than 0x00110000 to strings provided by the utf8.char/utf8.codepoint
> pair would fail.
This is not a guarantee of UTF-8.
>
> The way I read the Lua manual, disallowing particular in-range
> integers from being allowed as arguments is precisely the kind
> of thing that is declared to be outside the scope of he utf8 library.
>
The way I read the Lua manual, you should be able to understand Lua's approach to UTF-8 by just reading the RFC.
Jay