Re: utf8.len and BOM

From the Unicode standard:

>> The serialized order of the bytes must not depart from the order defined by the UTF-

>> 8 encoding form. Use of a BOM is neither required nor recommended for UTF-8, but may

>> be encountered in contexts where UTF-8 data is converted from other encoding forms that

>> use a BOM or where the BOM is used as a UTF-8 signature.

But, in a nutshell: having a BOM breaks unix utilities, not having it might break windows ones.

On Fri, Jan 16, 2015 at 6:17 PM, Coda Highland <chighland@gmail.com> wrote:

On Fri, Jan 16, 2015 at 4:53 AM, Rob Kendrick <rjek@rjek.com> wrote:
> On Fri, Jan 16, 2015 at 12:11:41PM +0000, Aapo Talvensaari wrote:
>> Is it by design that utf.len count the BOM to length?
>>
>> Say utf8.len("\xEF\xBB\xBFa") will return 2 instead of 1?
>
> Given UTF8 has only one valid "byte order", it makes no sense to ever
> include a byte order marker in a UTF8 document.
>

Sure it does -- the UTF-8 BOM is used (and aggressively promoted by
Microsoft) as a magic number to identify the contents of the file as
UTF-8 text. The XML spec even explicitly supports this (although many
XML parsers do not).

/s/ Adam