Re: Could Lua itself become UTF8-aware?

On Mon, May 1, 2017 at 12:08 PM, Ahmed Charles <acharles@outlook.com> wrote:

On 5/1/2017 6:00 AM, Roberto Ierusalimschy wrote:

> I want to be very clear that I am not against Unicode, quite the
> opposite. However, "accept everything above 127 as valid in identifiers"
> is quite different from supporting Unicode (again, it is quite the
> opposite). For data, the Lua approach is that it gives a very basic
> support and leave the rest for specialized libraries. Several programs
> have been using Lua with Unicode quite successful (e.g., Lightroom and
> LuaTeX).
>
> However, the Lua lexer should not depend on libraries. It would be great
> if the lexer could handle Unicode (correctly!), but I don't know how to
> do it without at least doubling the size of Lua. To do it (very) broken,
> I prefer to keep it as it is.
>
> -- Roberto

Lua 5.3.4 seems to accept the following code:

-- print á
print("á")

When encoded as UTF-8.

Looking at the manual, I can't tell whether this is intentional or not.
The program, when executed, does seem to function as expected. Can you
clarify if this program is intended to be valid Lua?

If this code is intended to be valid Lua, then I'd think it has enough
Unicode support already. The only remaining support I could think of
would be for identifiers, but that would require implementing UAX31 [1]
or similar, which would require at least some Unicode mapping tables.
Lua currently doesn't include these tables and lets this functionality
be up to external libraries. So, I fail to see how including these for
the significantly less useful feature of Unicode identifiers would be a
worthwhile trade-off.

Anecdotally:

Go [2] seems to do something simpler than what UAX31 suggests, which
seems to be allowing any Unicode letter followed by any Unicode letter
or Unicode digit, while considering underscore to be a Unicode letter.

Rust [3] seems to still be on the fence, though since minimalism isn't a
primary goal of the language and providing a larger set of Unicode
functionality would be beneficial, I expect a solution allowing
non-ascii identifiers to be arrived at eventually.

[1]: http://unicode.org/reports/tr31/
[2]: https://golang.org/ref/spec
[3]: https://github.com/rust-lang/rust/issues/28979

Gé