[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: lua for unicode
- From: lua+Steven.Murdoch@...
- Date: Tue, 03 Dec 2002 12:26:19 +0000
I think there are several issues being discussed here simultaneously so it
might be helpful to clarify what is being proposed.
Firstly there is the internal format Lua should use to encode Unicode strings.
In most programming languages this could be any format, however since Lua
exposes it's internals there would be advantages if it was a standard format.
There are plenty of standards in active use, each with there own tradeoffs and
these should all be considered. Since Lua allows embedded nulls in strings it
can support UTF-16/UTF-32 if needed as well as UTF-8. Changes required to
support Unicode would not be huge, mainly ensuring that string length is
properly calculated and that characters are properly iterated over (now that
there is no longer a direct octet-character correlation). An important
consideration to be made is whether all strings are Unicode or whether a new
Unicode type is to be added (as is done in Python).
Then there is the input and output of Unicode characters. This would require
changes to the I/O libraries to support the various encodings and convert them
to/from the Lua internal format. An important consideration is that in every
encoding not every byte pattern in a file is a valid Unicode character. It is
strongly recommend to consider any such byte pattern as an error and not try
to work around it. It is essential that such byte patterns do not exist in the
internal encoding since this opens several security issues. For this reason it
is sensible to use an existing parser which is well tested against such
issues. This would also allow all the commonly used Unicode encodings to be
Also there is the representation of Unicode character literals in Lua
programs. Most, if not all languages have done this by escape codes system
(for example \uxxxx and \Uxxxxxxxx in Python) rather than having UTF-8 or
other Unicode input files. This technique has the advantage of keeping the
parser small and allows any text editor to be used to edit Lua programs. Again
here checks must be made to ensure that no invalid Unicode characters can be
Finally there is the issue of whether to allow Unicode identifiers. This would
require many changes to the parser and would require that Lua programs were
edited in a Unicode aware text editor. I would consider the disadvantages of
doing this to far outweigh any advantages, and most, if not all other
programming langauges, do not permit multibyte characters in programs, either
as literals or identifiers. I do not know the current encoding used for Lua
programs, is it ASCII, Latin-1 or is it defined by the platform?
Hope this helps,