lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


  Le lun. 4 nov. 2019 à 10:51, bil til <flyer31@googlemail.com> a écrit :
> but I would like to include Unicode ONLY for variable names (and of course
> for string contents, but therefore it is included already in lua). As I
> understand it, this usually would NOT touch the basic lua texting, nor the
> libraries if I understand this correctly.
>

Stop this discussion. The behavior needed for identifiers in Unicode is completely documented:

  https://unicode.org/reports/tr31/

This is used for example in _javascript_ and SQL identifiers (note that optional characters in table 3A are generally not great for programming languages, but may be used for example in CSS identifiers, with the exception of the ASCII apostrophe-quote).

If you want more restrictions (avoiding a more confusable characters), there's also the specifications for labels in domain names:

  http://www.unicode.org/reports/tr46/

> (I assume if in a library a variable name is used in string form, it is just
> a zero-terminated string, but this keeps the same if UTF8 is allowed).

zero-terminated strings are internal artefacts of legacy C/C++ strings, they may be used internally in implementations but Lua strings don't have this limitation and used an explicit length property. Anyway, the null-termination is not a problem here as null bytes should not even be part of any valid identifier and should never be present in any Lua script source, in any common encoding (all based on ASCII with extension); even if the source is in UTF-16(BE/LE), the bytes are not used in isolation but always paired with a non-null byte (so it will be invisible from the start to the parser, because of the charset-decoder layer).

> (somtimes you would possibly use tolower or toupper with such variable
> names, but this tolower and toupper then of course will operate only on the
> ASCII chars, these 2 functions leave the non-ascii bytes all untouched
> (Unicode-UTF8-Charpoints only have bytes in the range 0x80...0xFF, those
> bytes are NOT touched by toupper / to lower)).

(but note that in IDN, labels are not case-sensitive, and case conversion can create a few more confusable pairs that would not cause problem in Lua whose identifiers are case-sensitive; anyway case conversions are more complex in general, see: dotted/dotless letters, sharp s and Latin ligatures, Greek sigma and iota).

However toupper/tolower functions MAY change non-ASCII bytes even in C/C++ depending on the locale settings used and the implementation of the C/C++ library (and possibly the support of locales data in the underlying OS). In Lua however the similar functions from the standard Lua string libary should be independant of the encoding and such locale-sensitive behavior should also be avoided in Lua source files. Anyway, I repeat, Lua identifiers are case-sensitive, so this complexity for casemappings is avoided from the start.

You'll be more frequently affected by the case sensitivity of filenames in some filesystems, if you assume that Lua identifiers may map somewhere to filenames (e.g. in script loaders for libraries when these files will be in a case-insensitive filesystem like FAT16/32/exFAT or NTFS, which all have specific casing rules for each mounted volume, NTFS volumes containing their own casemapping file, created when the volume is formatted, then made invisible and write-protected: these rules are static and independant of the Unicode version used later; such mapping works as if the volumes contained many aliases with hard links, and additional aliases are created for compatibility with CPM and DOS-like 8.3 names, also used in FAT32/exFAT volumes, formatted without a static case mapping file but with an internal case mapping implemented by the filesystem driver of the OS mounting these volumes).

This is not a problem of Lua itself but of development management, if you ever uses distinct script names like "sample.lua" and "Sample.lua" in the same parent path, or if they collide in a path search for a runtime import for "sample" (here you have other more important things to consider to avoid collisions and security risks in your runtime for the builtin dynamic resource loaders if they are used at runtime inside your application; the same kind of risks that also exist with shared libary loaders, search paths for commands in shell scripts...).