Re: question about Unicode

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: question about Unicode
From: Glenn Maynard <glenn@...>
Date: Thu, 7 Dec 2006 17:25:22 -0500

On Thu, Dec 07, 2006 at 03:44:05PM -0500, Brian Weed wrote:
> Asko Kauppi wrote:
> >But there may be some identifier "stamp" that can be used to know a 
> >file is UTF-8, no?
> There are two that I know of.  I don't know how "standard" they are.  
> One is called a BOM Header, which is some binary code in the first 2 
> bytes of the "text" file.

Three: 0xEF 0xBB 0xBF.  Don't use that unless you're writing
Windows-specific stuff and you really need to be compatible with
other Windows applications that expect it--it's not "binary" any
more than any other UTF-8 character, but text file encodings do not
have headers!  (And if you--the reader, not Brian Weed--do use this,
make it a save-time option and disable it by default if possible.)

> The other is the occurrence of this text 
> "charset=utf-8", anywhere in the file (at least according to the editor 
> I use: UltraEdit).

What if a Japanese writer is explaining, in a Shift-JIS, how to use this
feature?  "charset=utf-8" can legitimately appear in text files of any
encoding.  This email is not UTF-8, but it contains that string.  :)

There is no portable way to tell for sure whether a file is UTF-8.  If
you don't know the encoding of a file, you can only guess, but every
guessing mechanism can guess wrong.

-- 
Glenn Maynard

Follow-Ups:
- Re: question about Unicode, Russ Cox
- Re: question about Unicode, Robert Raschke

References:
- Re: question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, David Jones
- Re: question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, David Given
- Re: question about Unicode, Rici Lake
- Re: question about Unicode, Roberto Ierusalimschy
- Re: Re: question about Unicode, Ken Smith
- Re: question about Unicode, Adrian Perez
- Re: question about Unicode, Asko Kauppi
- Re: question about Unicode, Brian Weed

Prev by Date: Re: question about Unicode
Next by Date: Serializing Lua Functions
Previous by thread: Re: question about Unicode
Next by thread: Re: question about Unicode
Index(es):
- Date
- Thread