lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Not really, Unicode's TR31 is clear and gives examples, notably for IDNA. There are recommended scripts, and others that may be dropped (it used to exist a category of "aspirational scripts", but they are not moved to the same group as "limited scripts" in all applications.
You may want to renew the rules when IDNA will be updated. But the list of codepoints allowed in IDNA is very stable and does not extend to many new additions (notably not any one of the emojis characters and their complex sequences, and not any one of the invisible controls like joiners and disjoiners used in very limited cases for languages like Mongolian or Lao).
We can perfectly write a valid program in Lua using only the subset described in the RFC for IDNA. And of course programs should be using a normalized form (preferably the NFC), and normalization is warrantied to be stable in Unicode: so you can upgrade your unicode library like ICU when you want, it will NOT affect the result of normalization).
Any decent program using Unicode should have an implementation of the standard Unicode normalization, and this is not a lot of code (libraries like ICU are now stable and have been improved a lot for performance, and I think that ICU4C, its implementation in C, is easy to integrate in a Lua parser, without havving to integrat all the features built in ICU; this has been done in most major OSes that should all have a native normalization support, it's just a shame that normalization is still not part of the modern C standards: the ICU normalization has even been prebuilt in a separate package from ICU, so that you can still use other internationalization libraries, without depending on CLDR data).
So take TR31 as the base, it is very clear about what is really needed and what codepoints you can drop from support. (HTML and Jaavascript/ECMAScript were much more relaxed and allow very arbitrary codepoints, limiting only a few ones in ASCII and the very stable set of non-characters, which is now frozen).
Note that it is still impossible to avoid all "confusable" characters, including in pure ASCII (e.g. "rn" vs. "m"; "0" vs. "o" or "O"; or "1" vs. "l") or ISO 8859-1 only (e.g. "ij" vs. "ÿ"), or with common scripts ("o" in Latin, Greek, Cyrillic and in many Indic scripts, Hebrew, Arabic, as well in ideographic forms; "P" in Latin vs. ER in Cyrillic or RHO in Greek; "beta" in Greek vs. Sharp S in German/Czech). In my opinion it's not to the language (in its lexical analysis) to make such checks. Even mixed scripts have valid usage (Unicode TR31 gives examples, also valid in IDN for brand names or project names, like when you translate "XML-file" into Russian as a single word where "XML" remains unchanged in Latin but the rest is in Russian Cyrillic; see also many brand names used in China and Japan using mixed scripts with embedded Latin with suffixes commonly written with kanas or ideographs...)
If you have confusable identifiers, it's up to the editor to provide an alternate view (e.g. by showing their "Punycode" conversion to ASCII to exchibit the difference: an IDE can do that on the fly, a Linter can detect such confusable uses).

Le mer. 6 nov. 2019 à 15:43, bil til <> a écrit :
Hi Marcus,
thank you for clarification, but yes, this is clear to me.

Just to avoid making things for lua too complicated, I just would
principally leave all this "validation checking" to the responsibility of
the programmer / "lua end user".

So the programmer for his lua program should use some nice UTF8 editor which
should include some sort of "UTF8 validator / checker" ... such "Validator/
Checker" should then somehow mark all "strange/ambiguous UTF chars"... .

I think such a validatoin / checking process requires huge tables and needs
to be updated continuosly, as also new Unicode chars seem to be added quite
"continuously" to my impression... .

Sent from: