lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Thu, Dec 07, 2006 at 03:44:05PM -0500, Brian Weed wrote:
> Asko Kauppi wrote:
> >But there may be some identifier "stamp" that can be used to know a 
> >file is UTF-8, no?
> There are two that I know of.  I don't know how "standard" they are.  
> One is called a BOM Header, which is some binary code in the first 2 
> bytes of the "text" file.

Three: 0xEF 0xBB 0xBF.  Don't use that unless you're writing
Windows-specific stuff and you really need to be compatible with
other Windows applications that expect it--it's not "binary" any
more than any other UTF-8 character, but text file encodings do not
have headers!  (And if you--the reader, not Brian Weed--do use this,
make it a save-time option and disable it by default if possible.)

> The other is the occurrence of this text 
> "charset=utf-8", anywhere in the file (at least according to the editor 
> I use: UltraEdit).

What if a Japanese writer is explaining, in a Shift-JIS, how to use this
feature?  "charset=utf-8" can legitimately appear in text files of any
encoding.  This email is not UTF-8, but it contains that string.  :)

There is no portable way to tell for sure whether a file is UTF-8.  If
you don't know the encoding of a file, you can only guess, but every
guessing mechanism can guess wrong.

-- 
Glenn Maynard