[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Should Lua be more strict about Unicode errors?
- From: Dirk Laurie <dirk.laurie@...>
- Date: Wed, 2 Sep 2015 20:59:13 +0200
2015-09-02 18:03 GMT+02:00 Jay Carlson <nop@nop.com>:
> It should be more illegal. :-) 0xd800 is outside the domain of any function converting codepoints to UTF-8. What possible UTF-8 string can it return?
> ("%X%X%X"):format(string.byte(utf8.char(0xd800),1,-1))
EDA080
This string is translated back to 55296 (i.e. 0xd800) by e.g. the
'utf8' pattern in the LPeg manual.
> The way I read the Lua manual, you should be able to understand
> Lua's approach to UTF-8 by just reading the RFC.
On the contrary, the three-letter sequence RFC does not occur
in the manual. I estimate that not more than 1% of people who
have read the Lua manual have also read RFC3629. Quite a
few more have read the Wikipedia page, though, which says on
this topic:
~~~
According to the UTF-8 definition (RFC 3629) the high and low
surrogate halves used by UTF-16 (U+D800 through U+DFFF) are not legal
Unicode values, and their UTF-8 encoding should be treated as an
invalid byte sequence.
Whether an actual application should do this is debatable, as it makes
it impossible to store invalid UTF-16 (that is, UTF-16 with unpaired
surrogate halves) in a UTF-8 string. This is necessary to store
unchecked UTF-16 such as Windows filenames as UTF-8. It is also
incompatible with CESU encoding (described below).
~~~