Re: luaL_loadfile doesn't like the UTF-8 BOM

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: luaL_loadfile doesn't like the UTF-8 BOM
From: Owen Shepherd <owen.shepherd@...>
Date: Mon, 09 Jul 2012 14:18:40 +0100

Simon Orde

09 July 2012 13:28

Many thanks for the replies to this - all helpful.

Philippe - I have to think about legacy scripts written in ANSI, and the possibility that people may want to write new scripts to be run against old versions of my software, which expects ANSI. Yes I can easily convert old ANSI scripts if I know that that's what they are, but users can import scripts written by others, and without a BOM it's tricky to distinguish the encoding. Also, internally my app will use UTF-16 so I have to do a lot of string conversion anyway. It's not significantly harder for me to add support for ANSI as a conversion target, as well as UTF-8. Also, I want plugin authors to be able to work in ANSI anyway if they want to, because it's easier. I agree that there are ways of doing it which avoid the need for a BOM. I'm not sure if they are better though.

Simon

Note that even in the case you have an "ANSI" source file you have no idea what encoding that is - For example, for a Russian user their system 'ANSI' encoding is Windows-1251, while for a speaker of a Western European language (such as English) the encoding will be Windows-1252. And then there are some languages added more recently... for which there isn't even an ANSI locale! So, in other words... ANSI is a mess.

My recommended strategy for this sort of transition:

Read the first bytes of the file. If they're a UTF-8 BOM, assume UTF-8. If they're a UTF-16 BOM, assume UTF-16. If they're neither, then proceed...
Try to decode the text as UTF-8. UTF-8 without a BOM is not uncommon, and UTF-8's encoding has the nice property that misidentifying text in another encoding as it is unlikely. It is therefore reasonable to try UTF-8 on unknown text, and if it decodes, its very probably right
Failing that, decode in the system ANSI locale.

Whatever your result, I would suggest just converting your scripting interface to run on UTF-8 internally. You may break some scripts... but you'll also avoid the (frankly going to be very broken) case where you have, say, a UTF-8 script using an "ANSI" module..

With regards to luaL_loadfile: The only builtin Lua functions which load a file are

loadfile
dofile
package.searchers[2], as used by require (<-- you might want to check that index. A glance at the documentation suggests it is probably right)

It should be relatively easy to re-implement these in a fully Unicode aware manner (i.e. following the above suggested rules)

References:
- luaL_loadfile doesn't like the UTF-8 BOM, Simon Orde
- Re: luaL_loadfile doesn't like the UTF-8 BOM, Philippe Lhoste
- Re: luaL_loadfile doesn't like the UTF-8 BOM, Simon Orde

Prev by Date: Re: Arithmetic on strings
Next by Date: Re: luaL_loadfile doesn't like the UTF-8 BOM
Previous by thread: Re: luaL_loadfile doesn't like the UTF-8 BOM
Next by thread: Re: luaL_loadfile doesn't like the UTF-8 BOM
Index(es):
- Date
- Thread