lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


> -------- Original Message --------
> Date: Thu, 06 Jan 2011 00:05:41 +0100
> From: Lorenzo Donati <lorenzodonatibz@interfree.it>
> Subject: Lua interpreter and Lua files encoding
> To: Lua List <lua-l@lists.lua.org>
> Message-ID: <4D24F945.1070200@interfree.it>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> 
> Hi List!
> 
> For some weeks I've been playing with Lua and LaTeX. More specifically I 
> use Lua to generate LaTeX files.
> 
> Since I need to generate LaTeX files encoded in utf8, the Lua files used 
> to generate them are utf8 too, so I can embed fragments of utf8 text in 
> string literals, which I then assemble to build the final LaTeX file.
> 
> I never had any problem with this approach by now, but having struggled 
> for a while with LaTeX input/font encoding mess, I was struck by a 
> doubt: does the Lua interpreter really support utf8? And to what extent? 
> Or I was only lucky because my string literals didn't contained fancy 
> unicode chars, but only accented latin letters? I also did some testing 
> embedding arabic chars in Lua literals and they showed up correctly in 
> LaTeX files, so it seems it works.
> 
> I know that Lua in itself isn't Unicode compliant, but does the 
> interpreter behave well if the only non-ASCII Unicode chars are in 
> string literals (and in comments sometimes)? Is it a guaranteed behaviour?
> 
> I didn't found anything in the manual in this respect (beside the fact 
> that Lua is 8-bit clean). I also searched the mailing list archive, but 
> didn't find a definitive answer.
> 
> I begin to fear that I'm relying on undefined behaviour!
> 
> Thanks in advance for any explanation.
> 
> -- 
> Lorenzo

According to a test I ran, I believe UTF-8 encoding works correctly with
the interpreter (Lua 5.1.2). First off, here's the test code, with
comments which explain most everything.

----- BEGIN TEST -----

--Fill an array with the test values: all 8 bit chars
--except 0, 13, and 26.
t={}
for i=1,255 do
    if i~= 13 and i~= 26 then t[#t+1] = i end
end

--Write a Lua file which returns a string
--with all characters in the above array.
file = io.open("chartest.lua","wb")
file:write("return [[")
for _,c in ipairs(t) do file:write(string.char(c)) end
file:write("]]\n")
file:close()

--Run the just-written Lua file.
s = dofile("chartest.lua")

--Verify that all chars survived.
for i,c in ipairs(t) do
    assert(string.byte(s,i)==c)
end

print "Passed."
----- END TEST -----

The test skips char 13 because apparently CR/LF translation on my
Windows system causes 13 to show up as 10, this should be
inconsequential to the test.

Char 26 is skipped because it causes an EOF error in the interpreter
when present. All other 8-bit chars are fine.

Although this is an 8-bit char test, UTF-8 should work fine, the reasons
are:

(1) UTF-8 maps all values 0 thru 127 to 0 thru 127 respectively. This
means you will not have embedded zeros or char 26 due to an encoding,
unless they are actually ASCII 0 or 26 (which I believe would not be
present in the situation you describe).

(2) All 16 bit "UNICODE" values 128 thru 65535 will be encoded with
bytes that have the high bit set. Since the above test verifies that all
bytes with the high bit set (128 thru 255) are read correctly, this
should not be a problem, either.

BTW, I think it would be a good idea to use the long string [[...]]
format instead of quotes, though it isn't strictly necessary unless some
strings are known to contain quotes.

Reference:
http://en.wikipedia.org/wiki/Utf-8

Caveats:
(a) Windows build
(b) Lua version 5.1.2
(c) Wikipedia
(d) Verify my test code and reasoning  :)

John Giors
Independent Programmer
Three Eyes Software
jgiors@ThreeEyesSoftware.com
http://ThreeEyesSoftware.com