lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

On 02/09/15 01:03 PM, Jay Carlson wrote:
On 2015-08-30, at 2:35 PM, Dirk Laurie <> wrote:

2015-08-30 15:18 GMT+02:00 Jay Carlson <>:

For the purposes of Lua, UTF-8 is defined in RFC 3629,
an Internet Standard. (STD 63)  I’ll quote:

   The definition of UTF-8 prohibits encoding character numbers between
   U+D800 and U+DFFF, which are reserved for use with the UTF-16
   encoding form (as surrogate pairs) and do not directly represent
   characters.  When encoding in UTF-8 from UTF-16 data, it is necessary
   to first decode the UTF-16 data to obtain character numbers, which
   are then encoded in UTF-8 as described above.  This contrasts with
   CESU-8 [CESU-8], which is a UTF-8-like encoding that is not meant for
   use on the Internet.

Going back to the Lua manual:

This library provides basic support for UTF-8 encoding. It provides
all its functions inside the table utf8. This library does not provide
any support for Unicode other than the handling of the encoding.
Any operation that needs the meaning of a character, such as
character classification, is outside its scope.
The validity of “\u{d800}” is not a matter of Unicode other than the
encoding UTF-8.
I deduce that you mean "you can write '\u{d800}' but you shouldn't".
It must produce undefined behavior, as there is no UTF-8 sequence corresponding to 0xD800. From general Lua philosophy, I would guess that it would provoke a syntax error, or contribute some unknown but bounded sequence of octets to the string. In other words, it would *probably* not provoke C's undefined behavior.

I hope you agree that if '\u{d800}' is illegal, then utf8.char(0xd800)
should also be illegal.
It should be more illegal. :-) 0xd800 is outside the domain of any function converting codepoints to UTF-8. What possible UTF-8 string can it return? I am an Errorist, so you know what I think it should do.

But then the 1:1 mapping from numbers less
than 0x00110000 to strings provided by the utf8.char/utf8.codepoint
pair would fail.
This is not a guarantee of UTF-8.
The way I see it, it's for allowing invalid UTF-16 to be translated to (invalid) UTF-8?

According to the UTF-8 definition (RFC 3629) the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) are not legal Unicode values, and their UTF-8 encoding should be treated as an invalid byte sequence.

***Whether an actual application should do this is debatable, as it makes it impossible to store invalid UTF-16 (that is, UTF-16 with unpaired surrogate halves) in a UTF-8 string. This is necessary to store unchecked UTF-16 such as Windows filenames as UTF-8. It is also incompatible with CESU encoding (described below).***
The way I read the Lua manual, disallowing particular in-range
integers from being allowed as arguments is precisely the kind
of thing that is declared to be outside the scope of he utf8 library.

The way I read the Lua manual, you should be able to understand Lua's approach to UTF-8 by just reading the RFC.


Disclaimer: these emails are public and can be accessed from <TODO: get a non-DHCP IP and put it here>. If you do not agree with this, DO NOT REPLY.