lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


UTF-8 is a stadnard part of Unicode, defined in standard chapter 3 (conformance).
Unicode does not restrict which characters (code points) you can put in a string. If you are UTF-8 conforming, these codepoints can be any one that have a scalar value. This means you cannot place unpaired surrogates, because surrogates don't have any scalar value (so an isolated surrogate cannot be encoded into any "valid" UTF, including UTF-16, UTF-32, BOCU, or even the Chinese GB18030 standard which has now a stabilized definition that covers the whole UCS space).
The standard even gives a specifiction of byte sequences that are "conforming" in UTF-8, and these sequences explicitly exclude the surrogates space (even if surrogates are assigned a codepoint, they have no scalar value suitable for UTF-8. This is made so that ALL conforming UTF-8 text can be bijectively converted to any other conforming UTF, including UTF-16. Unpaired surrogates are unsupported.

Le dim. 17 mars 2019 à 23:58, Jay Carlson <nop@nop.com> a écrit :
On 2019-03-17, at 4:57 AM, Dirk Laurie <dirk.laurie@gmail.com> wrote:

> Lua in no way even comes close to validating against the current UTF-8
> standard.

Do you mean UTF-8, or do you mean Unicode?

Jay