Re: utf8.len and BOM

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: utf8.len and BOM
From: Coda Highland <chighland@...>
Date: Fri, 16 Jan 2015 09:29:16 -0800

On Fri, Jan 16, 2015 at 9:21 AM, Rob Kendrick <rjek@rjek.com> wrote:
> On Fri, Jan 16, 2015 at 09:17:08AM -0800, Coda Highland wrote:
>> On Fri, Jan 16, 2015 at 4:53 AM, Rob Kendrick <rjek@rjek.com> wrote:
>> > On Fri, Jan 16, 2015 at 12:11:41PM +0000, Aapo Talvensaari wrote:
>> >> Is it by design that utf.len count the BOM to length?
>> >>
>> >> Say utf8.len("\xEF\xBB\xBFa") will return 2 instead of 1?
>> >
>> > Given UTF8 has only one valid "byte order", it makes no sense to ever
>> > include a byte order marker in a UTF8 document.
>> >
>>
>> Sure it does -- the UTF-8 BOM is used (and aggressively promoted by
>> Microsoft) as a magic number to identify the contents of the file as
>> UTF-8 text.
>
> Lots of things aggressively promoted by Microsoft are mistakes.
>
> No BOM -> content is UTF8-encoded.
>

I didn't say it was a good idea. ;) Only that there were circumstances
under it could, theoretically, make sense.

That said, it still pisses me off how many common XML parsers bomb out
on something supported by the spec.

/s/ Adam

References:
- utf8.len and BOM, Aapo Talvensaari
- Re: utf8.len and BOM, Rob Kendrick
- Re: utf8.len and BOM, Coda Highland
- Re: utf8.len and BOM, Rob Kendrick

Prev by Date: Re: utf8.len and BOM
Next by Date: Lua 5.3: wrong coercion?
Previous by thread: Re: utf8.len and BOM
Next by thread: Re: utf8.len and BOM
Index(es):
- Date
- Thread