On 2015-08-30, at 2:35 PM, Dirk Laurie <dirk.laurie@gmail.com> wrote:
2015-08-30 15:18 GMT+02:00 Jay Carlson <nop@nop.com>:
For the purposes of Lua, UTF-8 is defined in RFC 3629,
an Internet Standard. (STD 63)
https://www.rfc-editor.org/info/rfc3629 I’ll quote:
The definition of UTF-8 prohibits encoding character numbers between
U+D800 and U+DFFF, which are reserved for use with the UTF-16
encoding form (as surrogate pairs) and do not directly represent
characters. When encoding in UTF-8 from UTF-16 data, it is necessary
to first decode the UTF-16 data to obtain character numbers, which
are then encoded in UTF-8 as described above. This contrasts with
CESU-8 [CESU-8], which is a UTF-8-like encoding that is not meant for
use on the Internet.
Going back to the Lua manual:
This library provides basic support for UTF-8 encoding. It provides
all its functions inside the table utf8. This library does not provide
any support for Unicode other than the handling of the encoding.
Any operation that needs the meaning of a character, such as
character classification, is outside its scope.
The validity of “\u{d800}” is not a matter of Unicode other than the
encoding UTF-8.
I deduce that you mean "you can write '\u{d800}' but you shouldn't".
It must produce undefined behavior, as there is no UTF-8 sequence corresponding to 0xD800. From general Lua philosophy, I would guess that it would provoke a syntax error, or contribute some unknown but bounded sequence of octets to the string. In other words, it would *probably* not provoke C's undefined behavior.
I hope you agree that if '\u{d800}' is illegal, then utf8.char(0xd800)
should also be illegal.
It should be more illegal. :-) 0xd800 is outside the domain of any function converting codepoints to UTF-8. What possible UTF-8 string can it return? I am an Errorist, so you know what I think it should do.
But then the 1:1 mapping from numbers less
than 0x00110000 to strings provided by the utf8.char/utf8.codepoint
pair would fail.
This is not a guarantee of UTF-8.