Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8

Subject: Re: Issues: Character 160 - Non-breaking space + Additional Issue with UTF-8
From: Alysson Cunha &lt;alyssonrpg@ ... &gt;
Date: Sat, 7 Jul 2018 13:58:16 -0300

> To be pedantic, the backwards compatibility is because of the utf-8
> encoding, not because of Unicode. And that was on purpose, not by
> miracle :)

The first 128 Unicode Code Points (That 32bit unsigned numbers that map to known characters) / 7bit mask were also made to be ASCII compatible.. Not just the UTF-8 Encoding pattern.

> A full unicode character database takes multiple megabytes[1]. That is
> dozens of times larger than the whole Lua interpreter is right now.

Thats right. And I agree with you... We would not need the full unicode data. Initially, the unicode white spaces should be compatible, because whitespace is part of the Lua language, but they are represented slighty differently in unicode.

On Sat, Jul 7, 2018 at 1:41 PM Hugo Musso Gualandi <hgualandi@inf.puc-rio.br> wrote:

> By miracle, if you do not use the "wrong" unicode characters, LUA
> accept it, because UNICODE was made to be backward compatible with
> ASCII till some point

To be pedantic, the backwards compatibility is because of the utf-8
encoding, not because of Unicode. And that was on purpose, not by
miracle :)

> Note: Using the public unicode character database it's easy to handle
> all white space characters of unicode.

A full unicode character database takes multiple megabytes[1]. That is
dozens of times larger than the whole Lua interpreter is right now.

You would need to trim down the database, which would mean either a
restrictive "whitelist" of allowed characters (for example, different
whitespace is allowed but not chinese characters) or an overly
permissive system (for example, all characters are allowed in
identifiers, including non-alphabetical ones). I'm not sure either of
these are better than the ASCII status quo.

[1] http://apps.icu-project.org/datacustom/