[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: luaL_loadfile doesn't like the UTF-8 BOM
- From: Philippe Lhoste <PhiLho@...>
- Date: Mon, 09 Jul 2012 12:54:33 +0200
On 09/07/2012 11:14, Simon Orde wrote:
Hi - for reasons discussed earlier (see "Future Plans for Lua and Unicode"), I want to
allow Lua scripts to be encoded in either ANSI or UTF-8. Legacy scripts are
currently ANSI, but with future ones, script authors will be able to choose between ANSI
or UTF-8 (for interest, I plan to say that the encoding of strings passed to/returned from
my app's API must match/will match the encoding of the script itself). I use a UTF-8 BOM
to allow me to distinguish between ANSI and UTF-8 scripts. This seems to work fine for a
simple script. I use lua_load with my own supplied reader to load each script, check the
BOM (which I need to do anyway, so I know how to handle strings), and then jump past the
BOM if there is one.
Script authors can also write Lua modules, however, and by default these are loaded by
luaL_loadfile, which doesn't like the UTF8-BOM and throws an error. This isn't actually
that serious a problem, because one simple solution would be for me to simply specify that
modules must always be encoded in ANSI. However, if possible, I would prefer to allow
modules to be encoded in either UTF-8 or ANSI too (the rule about string encoding matching
script encoding would not apply to modules). Can anyone suggest a way that I can do this,
while still retaining the UTF-8 BOM? I had a look at the package.loaders section in the
manual, but this seems to only provide a way to have a module-specific loader, whereas my
requirement applies to all modules.
From what I have read in this very mailing list from people more knowledgeable than me,
there is no real, official UTF-8 BOM...
It is an invention of Microsoft, written, for example, by Notepad when saving in UTF-8.
Might I ask why you maintain two set of encodings? It would be simpler to make everything
UTF-8. If a script is pure Ascii, no problem, they are identical to the UTF-8 version. If
a script is using "Ansi" encoding (hasn't somebody mentioned it is also not a real
encoding, just a Microsoft shortcut to ISO-8856-1 or -15 or something like that?), it
should be trivial to convert them.
Possible alternatives to this pseudo-BOM (Lua is probably not the only software barfing on
it) can be indicating the encoding in the name of the file, or using a special first line,
like the XML files do with encoding="UTF-8" attribute.
-- (near) Paris -- France
-- -- -- -- -- -- -- -- -- -- -- -- -- --