Re: Lua interpreter and Lua files encoding

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Lua interpreter and Lua files encoding
From: Lorenzo Donati <lorenzodonatibz@...>
Date: Thu, 06 Jan 2011 00:57:44 +0100

Thanks for the prompt reply!

Drake Wilson wrote:

Quoth Lorenzo Donati <lorenzodonatibz@interfree.it>, on 2011-01-06 00:05:41 +0100:

I know that Lua in itself isn't Unicode compliant, but does the
interpreter behave well if the only non-ASCII Unicode chars are in
string literals (and in comments sometimes)? Is it a guaranteed
behaviour?


"Unicode compliant" doesn't mean a whole lot here.  As far as I know,
arbitrary octets can be embedded in string literals and they'll just
be passed through transparently.  This means if the source encoding is
UTF-8 then non-ASCII UTF-8 sequences will show up as the same octet
sequences.  I interpret « Strings in Lua can contain any 8-bit value,
including embedded zeros, which can be specified as '\0'. » from the
Lua 5.1 manual (section 2.1) to imply that this is true for source as
well, but I didn't write the manual, so...

Yes, I read that too, and I'm aware that one can code arbitrary octetsusing escape sequences like \043 \123 etc.My doubt is with true non-ASCII characters: if I in SciTE (set in utf8encoding mode) enter, say, an accented 'e' like this: è, which shouldmap to a two byte utf8 sequence (if I remember some testing I've done),can I be sure that those two bytes, embedded in a literal will becorrectly interpreted by the interpreter (sorry for the pun)? Are theyalways equivalent to the corresponding pair of escape sequences? Arethere unicode chars that when inserted in a literal will be encoded as amultioctet sequence which is illegal in a literal (but would be legal iftyped using escape sequences).

I know Lua can _store_ any octet sequence in a string. The doubt is withthe interpreter executable: can it read and always parse a utf8 filewith non-ASCII chars in some literals/comments?


This does mean that if your source files are ever recoded into some
other charset, your literals will break because the execution coding
will have implicitly changed as well.  If this is important, you can
test the octets of a known string early on and raise an error if they
don't look correct.

Ok, thanks for the suggestion, but I doubt it will happen (exceptbecause of my mistakes), since I have full control of the files (I writethem using SciTE using a utf8 cookie near the top, so they always showup in the correct mode)

Things like the length operator and stock Lua string operations will
neither respect nor choke on UTF-8 sequences; they will blindly treat
them as their component octets, with all the blessings and curses that
entails.

Yes, I did know that, but I don't need those facilities to work onUnicode strings (that terse and probably inexact "Unicode compliant"expression in my post meant just that, at least it was my intention :-) )


Does that answer your question?

Not completely, but it gives some clues, thanks anyway.

Follow-Ups:
- Re: Lua interpreter and Lua files encoding, Javier Guerra Giraldez
- Re: Lua interpreter and Lua files encoding, David Manura
- Re: Lua interpreter and Lua files encoding, Roberto Ierusalimschy

References:
- Lua interpreter and Lua files encoding, Lorenzo Donati
- Re: Lua interpreter and Lua files encoding, Drake Wilson

Prev by Date: Re: Accessing in a child class the attribute of a parent class
Next by Date: USERDATA getters, setters and methods requirement
Previous by thread: Re: Lua interpreter and Lua files encoding
Next by thread: Re: Lua interpreter and Lua files encoding
Index(es):
- Date
- Thread