lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Alysson, are you talking about utf-8 in string lineral? Well fine, I believe Lua already supports that.

Any utf-8 as variable names, any utf-8 encoding valid through the whole problem? No please not

My guess is you are severely underestimating the complexity of unicode and think of it ASCII with extras. It's not. There are so many (strange) featuers in unicode when used in general programming would go haywire. Unicode is designed as a typesetting tool, not as a programming tool.

For example there are hairspaces. Spaces that are none. Would make understaning a program/debugging horrible. In typesetting they are there for allowing line breaks, but compacting letters when not.

Then there is right-to-left control sequences. For right-to-left languages.

Then there are homoglyphs: This would all be different variables: A Α А Ꭺ ᗅ ᴀ ꓮ A 𐊠 𝐀 𝐴

Then there are combining modifier characters. 

And I guess more features I didn't had yet have the joy to encounter.

On Sat, Jul 7, 2018 at 4:54 PM, Alysson Cunha <> wrote:
I am raising the question: Should the future Lua 5.5 have unicode support? Since a lot (a lot a lot) encoding issues were solved with unicode and we are observing an international trend for the utf-8 use....

In my opinion: unicode is the future (actually, unicode is already the present for the past years), and ASCII was developed in 1960. Today, it is an old and very limitted character encoding.....

I would love to see LUA keep up to date.

On Sat, Jul 7, 2018 at 11:16 AM Hugo Musso Gualandi <> wrote:
Em sáb, 2018-07-07 às 09:44 -0300, Alysson Cunha escreveu:
> Issue #1) ---- Character 160
> Lua 5.3 is not recognizing the character 160 / 0xA0
> ( as space inside
> code as space.

The slippery slope of Unicode-supporting language syntax is that once
you allow some non-ASCII characters there is a temptation to allow all
of them (for instance there are many other whitespace characters[1] you
did not mention). This can be confusing (there are different characters
that look similar to each other) and also problematic to implement (the
large character-class tables would bloat the interpreter)


> When pasting a text from some browsers/text editors, the following
> text come to my code:
> "function(stream, contentType)"
> The space that separates "stream," and "contentType" is a Character
> 160, not Character 32.

This sounds like an issue with the browser / text editor.

> Issue #2) ----- UTF-8
> The same character #160 when encoded as UTF-8 becomes the 2 bytes
> 0xC2 0xA0.
> The 0xC2 character in ISO 8859-1 (Latin-1) codification is the
> character "Â". What?

That is just how UTF-8 and Latin-1 work. UTF-8 text can look all messed
up and full of "é" and so on if your misconfigured software mistakenly
tries to interpret it as Latin-1. (Like the web servers
do all the time. Aaargh!)

Make sure that you configure things to properly display things as UTF-
8. For example if you are making a webpage make sure you add a <meta
charset="utf-8"> near the top of the HTML.

> In my app, I strongly advise the users to encode their .lua file as
> UTF-8 because all of my system function expects utf-8 coding as
> string parameter

Current versions of Lua are perfectly content with non-ASCII utf8-
encoded characters, as long as they only appear inside strings or
comments. Non-ASCII characters in other parts of the program result in
syntax errors, as you found out.

> UTF-8 is a growing trend for internationalization, and lua_load
> should have a parameter that force the engines handle the lua script
> content as utf-8 encoded. Another sollution is to create lua_loadutf8
> function.

This would effectly fork Lua into two versions of the language -- an
ASCII-only one and a full Unicode aware one. I'm not sure the
compatibility headache from that would be worth the hassle. IMO, if Lua
is ever to allow Unicode syntax it should be part of the default
language and not require a separate "load" function.

> or... lua_loadutf16 (since UTF-16 is used as standard in many
> programming languages)

UTF-16 is an awful standard, and is inferior to UTF-8 in pretty much
every way. Unfortunately we will need to live with it for a long time
due to how it is entrenched in Windows, Java and _javascript_.

Alysson Cunha / AlyssonRPG - Jogue o tradicional RPG de mesa online