lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


> On 2015-08-30, at 2:35 PM, Dirk Laurie <dirk.laurie@gmail.com> wrote:
> 
> 2015-08-30 15:18 GMT+02:00 Jay Carlson <nop@nop.com>:
> 
>> For the purposes of Lua, UTF-8 is defined in RFC 3629,
>> an Internet Standard. (STD 63)
>> 
>> https://www.rfc-editor.org/info/rfc3629  I’ll quote:
>> 
>>>   The definition of UTF-8 prohibits encoding character numbers between
>>>   U+D800 and U+DFFF, which are reserved for use with the UTF-16
>>>   encoding form (as surrogate pairs) and do not directly represent
>>>   characters.  When encoding in UTF-8 from UTF-16 data, it is necessary
>>>   to first decode the UTF-16 data to obtain character numbers, which
>>>   are then encoded in UTF-8 as described above.  This contrasts with
>>>   CESU-8 [CESU-8], which is a UTF-8-like encoding that is not meant for
>>>   use on the Internet.
>> 
>> 
>> Going back to the Lua manual:
>> 
>>> This library provides basic support for UTF-8 encoding. It provides
>>> all its functions inside the table utf8. This library does not provide
>>> any support for Unicode other than the handling of the encoding.
>>> Any operation that needs the meaning of a character, such as
>>> character classification, is outside its scope.
>> 
>> The validity of “\u{d800}” is not a matter of Unicode other than the
>> encoding UTF-8.
> 
> I deduce that you mean "you can write '\u{d800}' but you shouldn't".

It must produce undefined behavior, as there is no UTF-8 sequence corresponding to 0xD800. From general Lua philosophy, I would guess that it would provoke a syntax error, or contribute some unknown but bounded sequence of octets to the string. In other words, it would *probably* not provoke C's undefined behavior.

> I hope you agree that if '\u{d800}' is illegal, then utf8.char(0xd800)
> should also be illegal.

It should be more illegal. :-) 0xd800 is outside the domain of any function converting codepoints to UTF-8. What possible UTF-8 string can it return? I am an Errorist, so you know what I think it should do.

> But then the 1:1 mapping from numbers less
> than 0x00110000 to strings provided by the utf8.char/utf8.codepoint
> pair would fail.

This is not a guarantee of UTF-8. 

> 
> The way I read the Lua manual, disallowing particular in-range
> integers from being allowed as arguments is precisely the kind
> of thing that is declared to be outside the scope of he utf8 library.
> 

The way I read the Lua manual, you should be able to understand Lua's approach to UTF-8 by just reading the RFC.

Jay