lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


jgiors@threeeyessoftware.com wrote:
[snip]

Caveats:
(a) Windows build
(b) Lua version 5.1.2
(c) Wikipedia
(d) Verify my test code and reasoning  :)


I tested your code and works for me (WinXP + Lua 5.1.4/Lua 5.2.0-alpha).

As for the reasoning, I find no fault in it. I'm no expert of Unicode/utf8, though.

It seems that if one sticks to literals with no octet in the range 0-31 (to be safe), utf8 Lua files should be safe.

The only problem may be the normalization algorithm cited by David in a branch of this thread:

David Manura wrote:

According to [1], the lexer does not guarantee reliable preservation
of arbitrary octets in string literals, so you may need to encode
these octets with escape sequences.  This is particularly due to ASCII
newlines ([\r\n]+) being normalized to '\n' (so that string literals
have the same meaning regardless of the newline encoding of the source
file).  There's a lexer change in 5.2.0-alpha eliminating dependence
on locales [2], but that doesn't alter the newline normalization--see
the `inclinenumber` in `read_long_string` in llex.c.

I'm no C expert, so I cannot comment on the Lua internals cited. But...


This indeed in sometimes unfortunate.  It means that Lua syntax is not
an ideal binary encoding format.


...even if a general binary stream cannot be encoded as a Lua file, can we at least depend on the fact that a stream of utf-8 octets (trusting what Wikipedia said) can be safely embedded in a string literal, as John's test seems to prove?

Anyway, is this only an implementation artifact? Or is something that will last? In this latter case a mention in the reference manual could be useful, since utf8 is very common nowadays and generating utf8 files using Lua, _without specialized libraries_ and without the hassle of encoding literals with escape sequence, is really a useful!

Thanks.

--
Lorenzo