lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Thanks for the prompt reply!

Drake Wilson wrote:
Quoth Lorenzo Donati <lorenzodonatibz@interfree.it>, on 2011-01-06 00:05:41 +0100:
I know that Lua in itself isn't Unicode compliant, but does the
interpreter behave well if the only non-ASCII Unicode chars are in
string literals (and in comments sometimes)? Is it a guaranteed
behaviour?

"Unicode compliant" doesn't mean a whole lot here.  As far as I know,
arbitrary octets can be embedded in string literals and they'll just
be passed through transparently.  This means if the source encoding is
UTF-8 then non-ASCII UTF-8 sequences will show up as the same octet
sequences.  I interpret « Strings in Lua can contain any 8-bit value,
including embedded zeros, which can be specified as '\0'. » from the
Lua 5.1 manual (section 2.1) to imply that this is true for source as
well, but I didn't write the manual, so...
Yes, I read that too, and I'm aware that one can code arbitrary octets using escape sequences like \043 \123 etc. My doubt is with true non-ASCII characters: if I in SciTE (set in utf8 encoding mode) enter, say, an accented 'e' like this: è, which should map to a two byte utf8 sequence (if I remember some testing I've done), can I be sure that those two bytes, embedded in a literal will be correctly interpreted by the interpreter (sorry for the pun)? Are they always equivalent to the corresponding pair of escape sequences? Are there unicode chars that when inserted in a literal will be encoded as a multioctet sequence which is illegal in a literal (but would be legal if typed using escape sequences).

I know Lua can _store_ any octet sequence in a string. The doubt is with the interpreter executable: can it read and always parse a utf8 file with non-ASCII chars in some literals/comments?



This does mean that if your source files are ever recoded into some
other charset, your literals will break because the execution coding
will have implicitly changed as well.  If this is important, you can
test the octets of a known string early on and raise an error if they
don't look correct.

Ok, thanks for the suggestion, but I doubt it will happen (except because of my mistakes), since I have full control of the files (I write them using SciTE using a utf8 cookie near the top, so they always show up in the correct mode)

Things like the length operator and stock Lua string operations will
neither respect nor choke on UTF-8 sequences; they will blindly treat
them as their component octets, with all the blessings and curses that
entails.
Yes, I did know that, but I don't need those facilities to work on Unicode strings (that terse and probably inexact "Unicode compliant" expression in my post meant just that, at least it was my intention :-) )


Does that answer your question?
Not completely, but it gives some clues, thanks anyway.