lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

09 July 2012 13:28
Many thanks for the replies to this - all helpful.

Philippe - I have to think about legacy scripts written in ANSI, and the possibility that people may want to write new scripts to be run against old versions of my software, which expects ANSI.  Yes I can easily convert old ANSI scripts if I know that that's what they are, but users can import scripts written by others, and without a BOM it's tricky to distinguish the encoding.  Also, internally my app will use UTF-16 so I have to do a lot of string conversion anyway.  It's not significantly harder for me to add support for ANSI as a conversion target, as well as UTF-8.  Also, I want plugin authors to be able to work in ANSI anyway if they want to, because it's easier.  I agree that there are ways of doing it which avoid the need for a BOM.  I'm not sure if they are better though.


Note that even in the case you have an "ANSI" source file you have no idea what encoding that is - For example, for a Russian user their system 'ANSI' encoding is Windows-1251, while for a speaker of a Western European language (such as English) the encoding will be Windows-1252. And then there are some languages added more recently... for which there isn't even an ANSI locale! So, in other words... ANSI is a mess.

My recommended strategy for this sort of transition:
  • Read the first bytes of the file. If they're a UTF-8 BOM, assume UTF-8. If they're a UTF-16 BOM, assume UTF-16. If they're neither, then proceed...
  • Try to decode the text as UTF-8. UTF-8 without a BOM is not uncommon, and UTF-8's encoding has the nice property that misidentifying text in another encoding as it is unlikely. It is therefore reasonable to try UTF-8 on unknown text, and if it decodes, its very probably right
  • Failing that, decode in the system ANSI locale.

Whatever your result, I would suggest just converting your scripting interface to run on UTF-8 internally. You may break some scripts... but you'll also avoid the (frankly going to be very broken) case where you have, say, a UTF-8 script using an "ANSI" module..

With regards to luaL_loadfile: The only builtin Lua functions which load a file are
  • loadfile
  • dofile
  • package.searchers[2], as used by require (<-- you might want to check that index. A glance at the documentation suggests it is probably right)
It should be relatively easy to re-implement these in a fully Unicode aware manner (i.e. following the above suggested rules)