[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: Lua interpreter and Lua files encoding
- From: Lorenzo Donati <lorenzodonatibz@...>
- Date: Thu, 06 Jan 2011 00:57:44 +0100
Thanks for the prompt reply!
Drake Wilson wrote:
Yes, I read that too, and I'm aware that one can code arbitrary octets
using escape sequences like \043 \123 etc.
My doubt is with true non-ASCII characters: if I in SciTE (set in utf8
encoding mode) enter, say, an accented 'e' like this: è, which should
map to a two byte utf8 sequence (if I remember some testing I've done),
can I be sure that those two bytes, embedded in a literal will be
correctly interpreted by the interpreter (sorry for the pun)? Are they
always equivalent to the corresponding pair of escape sequences? Are
there unicode chars that when inserted in a literal will be encoded as a
multioctet sequence which is illegal in a literal (but would be legal if
typed using escape sequences).
Quoth Lorenzo Donati <email@example.com>, on 2011-01-06 00:05:41 +0100:
I know that Lua in itself isn't Unicode compliant, but does the
interpreter behave well if the only non-ASCII Unicode chars are in
string literals (and in comments sometimes)? Is it a guaranteed
"Unicode compliant" doesn't mean a whole lot here. As far as I know,
arbitrary octets can be embedded in string literals and they'll just
be passed through transparently. This means if the source encoding is
UTF-8 then non-ASCII UTF-8 sequences will show up as the same octet
sequences. I interpret « Strings in Lua can contain any 8-bit value,
including embedded zeros, which can be specified as '\0'. » from the
Lua 5.1 manual (section 2.1) to imply that this is true for source as
well, but I didn't write the manual, so...
I know Lua can _store_ any octet sequence in a string. The doubt is with
the interpreter executable: can it read and always parse a utf8 file
with non-ASCII chars in some literals/comments?
Ok, thanks for the suggestion, but I doubt it will happen (except
because of my mistakes), since I have full control of the files (I write
them using SciTE using a utf8 cookie near the top, so they always show
up in the correct mode)
This does mean that if your source files are ever recoded into some
other charset, your literals will break because the execution coding
will have implicitly changed as well. If this is important, you can
test the octets of a known string early on and raise an error if they
don't look correct.
Yes, I did know that, but I don't need those facilities to work on
Unicode strings (that terse and probably inexact "Unicode compliant"
expression in my post meant just that, at least it was my intention :-) )
Things like the length operator and stock Lua string operations will
neither respect nor choke on UTF-8 sequences; they will blindly treat
them as their component octets, with all the blessings and curses that
Does that answer your question?
Not completely, but it gives some clues, thanks anyway.