lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

On 1 May 2017 at 15:05, Dirk Laurie <> wrote:
> 2017-04-30 23:11 GMT+02:00 Sean Conner <>:
>>   There was a long discussion about that a few years ago:
>>   It appears the consensus then was "maybe not a good idea."
>>   -spc
> I did not read that before posting, and Sean is careful not to imply that
> old issues are dead and buried,
> But UTF-8 has come closer to universal acceptance in the last three
> years.  What was "maybe not a good idea" back then, might have
> become" maybe not a bad idea" now.
> -- Dirk

I don't think UTF-8's acceptance factor has changed at all in the last
~5-10 years: it has always been highly accepted outside of
interoperation with windows native APIs. Around 2005-2006 was when I
started hearing C# and Java devs wish they had UTF-8 instead.

However, to reply to the issue at hand: are unicode classes wanted?
i.e. should a unicode space such as U+2001 count as whitespace for
token separation?
Furthermore, what should be considered valid characters for identifiers?
I guess we still want the rule "alpha followed by any number of alphanumeric"?
Which Unicode standard do we want to pick? (You did realise unicode
gets updated.... right?)
We'd need a strategy to deal with updates (which rarely go well: see
how people are still dealing with fallout from IDNA2003 => IDNA2008)

Which brings us to the next problem: normalisation of identifiers. It
would seem perplexing to many that the identifiers U+00C5 and U+0041
U+030A would refer to different variables.
Even if you don't think normalisation should occur (like myself), then
you'll at least have an easy mechanism for obfuscated code