lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


2015-09-02 18:03 GMT+02:00 Jay Carlson <nop@nop.com>:

> It should be more illegal. :-) 0xd800 is outside the domain of any function converting codepoints to UTF-8. What possible UTF-8 string can it return?

> ("%X%X%X"):format(string.byte(utf8.char(0xd800),1,-1))
EDA080

This string is translated back to 55296 (i.e. 0xd800) by e.g. the
'utf8' pattern in the LPeg manual.

> The way I read the Lua manual, you should be able to understand
> Lua's approach to UTF-8 by just reading the RFC.

On the contrary, the three-letter sequence RFC does not occur
in the manual. I estimate that not more than 1% of people who
have read the Lua manual have also read RFC3629. Quite a
few more have read the Wikipedia page, though, which says on
this topic:

~~~
According to the UTF-8 definition (RFC 3629) the high and low
surrogate halves used by UTF-16 (U+D800 through U+DFFF) are not legal
Unicode values, and their UTF-8 encoding should be treated as an
invalid byte sequence.

Whether an actual application should do this is debatable, as it makes
it impossible to store invalid UTF-16 (that is, UTF-16 with unpaired
surrogate halves) in a UTF-8 string. This is necessary to store
unchecked UTF-16 such as Windows filenames as UTF-8. It is also
incompatible with CESU encoding (described below).
~~~