Re: Should Lua be more strict about Unicode errors?

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Should Lua be more strict about Unicode errors?
From: "Soni L." <fakedme@...>
Date: Wed, 2 Sep 2015 14:44:25 -0300



On 02/09/15 01:03 PM, Jay Carlson wrote:

On 2015-08-30, at 2:35 PM, Dirk Laurie <dirk.laurie@gmail.com> wrote:

2015-08-30 15:18 GMT+02:00 Jay Carlson <nop@nop.com>:

For the purposes of Lua, UTF-8 is defined in RFC 3629,
an Internet Standard. (STD 63)

https://www.rfc-editor.org/info/rfc3629  I’ll quote:

   The definition of UTF-8 prohibits encoding character numbers between
   U+D800 and U+DFFF, which are reserved for use with the UTF-16
   encoding form (as surrogate pairs) and do not directly represent
   characters.  When encoding in UTF-8 from UTF-16 data, it is necessary
   to first decode the UTF-16 data to obtain character numbers, which
   are then encoded in UTF-8 as described above.  This contrasts with
   CESU-8 [CESU-8], which is a UTF-8-like encoding that is not meant for
   use on the Internet.


Going back to the Lua manual:

This library provides basic support for UTF-8 encoding. It provides
all its functions inside the table utf8. This library does not provide
any support for Unicode other than the handling of the encoding.
Any operation that needs the meaning of a character, such as
character classification, is outside its scope.

The validity of “\u{d800}” is not a matter of Unicode other than the
encoding UTF-8.

I deduce that you mean "you can write '\u{d800}' but you shouldn't".

It must produce undefined behavior, as there is no UTF-8 sequence corresponding to 0xD800. From general Lua philosophy, I would guess that it would provoke a syntax error, or contribute some unknown but bounded sequence of octets to the string. In other words, it would *probably* not provoke C's undefined behavior.

I hope you agree that if '\u{d800}' is illegal, then utf8.char(0xd800)
should also be illegal.

It should be more illegal. :-) 0xd800 is outside the domain of any function converting codepoints to UTF-8. What possible UTF-8 string can it return? I am an Errorist, so you know what I think it should do.

But then the 1:1 mapping from numbers less
than 0x00110000 to strings provided by the utf8.char/utf8.codepoint
pair would fail.

This is not a guarantee of UTF-8.

The way I see it, it's for allowing invalid UTF-16 to be translated to(invalid) UTF-8?


https://en.wikipedia.org/wiki/UTF-8#Invalid_code_points

"

According to the UTF-8 definition (RFC 3629) the high and low surrogatehalves used by UTF-16 (U+D800 through U+DFFF) are not legal Unicodevalues, and their UTF-8 encoding should be treated as an invalid bytesequence.

***Whether an actual application should do this is debatable, as itmakes it impossible to store invalid UTF-16 (that is, UTF-16 withunpaired surrogate halves) in a UTF-8 string. This is necessary to storeunchecked UTF-16 such as Windows filenames as UTF-8. It is alsoincompatible with CESU encoding (described below).***

The way I read the Lua manual, disallowing particular in-range
integers from being allowed as arguments is precisely the kind
of thing that is declared to be outside the scope of he utf8 library.

The way I read the Lua manual, you should be able to understand Lua's approach to UTF-8 by just reading the RFC.

Jay


--
Disclaimer: these emails are public and can be accessed from <TODO: get a non-DHCP IP and put it here>. If you do not agree with this, DO NOT REPLY.

Follow-Ups:
- Re: Should Lua be more strict about Unicode errors?, Roberto Ierusalimschy

References:
- Re: Should Lua be more strict about Unicode errors?, Jay Carlson

Prev by Date: Re: Should Lua be more strict about Unicode errors?
Next by Date: Re: Should Lua be more strict about Unicode errors?
Previous by thread: Re: Should Lua be more strict about Unicode errors?
Next by thread: Re: Should Lua be more strict about Unicode errors?
Index(es):
- Date
- Thread