Many thanks for the replies to
this - all helpful.
Philippe - I have to think about legacy scripts written in ANSI, and
the
possibility that people may want to write new scripts to be run against
old
versions of my software, which expects ANSI. Yes I can easily convert
old
ANSI scripts if I know that that's what they are, but users can import
scripts written by others, and without a BOM it's tricky to distinguish
the
encoding. Also, internally my app will use UTF-16 so I have to do a lot
of
string conversion anyway. It's not significantly harder for me to add
support for ANSI as a conversion target, as well as UTF-8. Also, I want
plugin authors to be able to work in ANSI anyway if they want to,
because
it's easier. I agree that there are ways of doing it which avoid the
need
for a BOM. I'm not sure if they are better though.
Simon
Note that even in the case you have an "ANSI" source file you have no
idea what encoding that is - For example, for a Russian user their
system 'ANSI' encoding is Windows-1251, while for a speaker of a Western
European language (such as English) the encoding will be Windows-1252.
And then there are some languages added more recently... for which there
isn't even an ANSI locale! So, in other words... ANSI is a mess.
My recommended strategy for this sort of transition:
- Read the first bytes of the file. If they're a UTF-8 BOM, assume
UTF-8. If they're a UTF-16 BOM, assume UTF-16. If they're neither, then
proceed...
- Try to decode the text as UTF-8. UTF-8 without a BOM is not
uncommon, and UTF-8's encoding has the nice property that misidentifying
text in another encoding as it is unlikely. It is therefore reasonable
to try UTF-8 on unknown text, and if it decodes, its very probably right
- Failing that, decode in the system ANSI locale.
Whatever your result, I would suggest just converting your scripting
interface to run on UTF-8 internally. You may break some scripts... but
you'll also avoid the (frankly going to be very broken) case where you
have, say, a UTF-8 script using an "ANSI" module..
With regards to luaL_loadfile: The only builtin Lua functions which load
a file are
- loadfile
- dofile
- package.searchers[2], as used by require (<-- you might want to
check that index. A glance at the documentation suggests it is probably
right)
It should be relatively easy to re-implement these in a fully Unicode
aware manner (i.e. following the above suggested rules)
|