lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


I think there are several issues being discussed here simultaneously so it 
might be helpful to clarify what is being proposed.

Firstly there is the internal format Lua should use to encode Unicode strings. 
In most programming languages this could be any format, however since Lua 
exposes it's internals there would be advantages if it was a standard format. 
There are plenty of standards in active use, each with there own tradeoffs and 
these should all be considered. Since Lua allows embedded nulls in strings it 
can support UTF-16/UTF-32 if needed as well as UTF-8. Changes required to 
support Unicode would not be huge, mainly ensuring that string length is 
properly calculated and that characters are properly iterated over (now that 
there is no longer a direct octet-character correlation). An important 
consideration to be made is whether all strings are Unicode or whether a new 
Unicode type is to be added (as is done in Python).

Then there is the input and output of Unicode characters. This would require 
changes to the I/O libraries to support the various encodings and convert them 
to/from the Lua internal format. An important consideration is that in every 
encoding not every byte pattern in a file is a valid Unicode character. It is 
strongly recommend to consider any such byte pattern as an error and not try 
to work around it. It is essential that such byte patterns do not exist in the 
internal encoding since this opens several security issues. For this reason it 
is sensible to use an existing parser which is well tested against such 
issues. This would also allow all the commonly used Unicode encodings to be 
supported.

Also there is the representation of Unicode character literals in Lua 
programs. Most, if not all languages have done this by escape codes system 
(for example \uxxxx and \Uxxxxxxxx in Python[1]) rather than having UTF-8 or 
other Unicode input files. This technique has the advantage of keeping the 
parser small and allows any text editor to be used to edit Lua programs. Again 
here checks must be made to ensure that no invalid Unicode characters can be 
stored.

Finally there is the issue of whether to allow Unicode identifiers. This would 
require many changes to the parser and would require that Lua programs were 
edited in a Unicode aware text editor. I would consider the disadvantages of 
doing this to far outweigh any advantages, and most, if not all other 
programming langauges, do not permit multibyte characters in programs, either 
as literals or identifiers. I do not know the current encoding used for Lua 
programs, is it ASCII, Latin-1 or is it defined by the platform?

[1] http://www.python.org/doc/current/ref/strings.html

Hope this helps,
Steven Murdoch.