[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: utf8.len and BOM
- From: Coda Highland <chighland@...>
- Date: Fri, 16 Jan 2015 09:29:16 -0800
On Fri, Jan 16, 2015 at 9:21 AM, Rob Kendrick <firstname.lastname@example.org> wrote:
> On Fri, Jan 16, 2015 at 09:17:08AM -0800, Coda Highland wrote:
>> On Fri, Jan 16, 2015 at 4:53 AM, Rob Kendrick <email@example.com> wrote:
>> > On Fri, Jan 16, 2015 at 12:11:41PM +0000, Aapo Talvensaari wrote:
>> >> Is it by design that utf.len count the BOM to length?
>> >> Say utf8.len("\xEF\xBB\xBFa") will return 2 instead of 1?
>> > Given UTF8 has only one valid "byte order", it makes no sense to ever
>> > include a byte order marker in a UTF8 document.
>> Sure it does -- the UTF-8 BOM is used (and aggressively promoted by
>> Microsoft) as a magic number to identify the contents of the file as
>> UTF-8 text.
> Lots of things aggressively promoted by Microsoft are mistakes.
> No BOM -> content is UTF8-encoded.
I didn't say it was a good idea. ;) Only that there were circumstances
under it could, theoretically, make sense.
That said, it still pisses me off how many common XML parsers bomb out
on something supported by the spec.