lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Anyway, is this only an implementation artifact? Or is something that
will last? In this latter case a mention in the reference manual could
be useful, since utf8 is very common nowadays and generating utf8 files
using Lua, _without specialized libraries_ and without the hassle of
encoding literals with escape sequence, is really a useful!

The Lua 5.1 reference manual defines that "strings in Lua can contain any 8-bit value" but it doesn't guarantee the same for literal strings embedded in Lua source code. So if you really want to guarantee compatibility with different Lua implementations (e.g. LuaJIT, Kahlua, luaj, Jill, LuaCLR, LuaToCee, the list* goes on for quite a while..) then it might be wise to encode UTF-8 string literals using escape sequences. On the other hand, the end of line normalization mentioned earlier should never corrupt valid UTF-8 sequences because UTF-8 was specifically designed to be compatible with ASCII.

I don't think the Lua reference manual should mention UTF-8 unless it will guarantee that string literals with UTF-8 contents are passed through unharmed. However the writing in the Lua reference manual is generally quite conservative. I think one of the reasons for this is to ease the implementation of Lua on a range of platforms with different characteristics.

 - Peter Odding


PS. Given the above I don't see how you would need "specialized libraries" to generate Lua source code containing literal strings with UTF-8 using escape sequences, i.e. the following should suffice to output such string literals:

function encode_literal(s)
  return '"' .. s:gsub('[^A-Za-z0-9 ]', function(c)
    return ('\\%d'):format(c:byte())
  end) .. '"'

print(encode_literal 'Ångström')