lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


expat aborts on a UTF-8 BOM, which is in violation of the XML spec.
Drives me bonkers; I can't let anyone on Windows edit my XML files.

/s/ Adam

On Mon, May 11, 2015 at 6:21 PM, Jay Carlson <nop@nop.com> wrote:
> Note that conforming XML applications MUST (in the formal sense) immediately
> stop processing at non-Chars. This is stricter than UTF-8 or Unicode: there
> is no way to represent a codepoint zero in XML.
>
> In 2015, nearly everybody is subsetting XML, so the full standards are
> getting weaker. Internal DOCTYPEs are just asking for trouble. Nobody wants
> to implement the whole thing.
>
> What the well-formedness constraints like charset mean in the real world:
> you can't complain if somebody downstream from you *does* strictly abort on
> your bad output, and there are many applications which will do so
> automatically because of tooling. Anything using expat, for example.
>
> If you don't abort, perhaps on the grounds of "be liberal in what you
> accept," you can get really nailed on "be conservative in what you send."
>
> Jay
>
> (who is stuck using a phone, because his second-oldest SSD is now dying a
> very weird death)
>
> On May 11, 2015 5:59 PM, "Coda Highland" <chighland@gmail.com> wrote:
>>
>> On Mon, May 11, 2015 at 2:31 PM, Tim Hill <drtimhill@gmail.com> wrote:
>> >
>> > On May 11, 2015, at 1:47 PM, Coda Highland <chighland@gmail.com> wrote:
>> >
>> > Well that’s true of the ZWNBSP *codepoint* U+FEFF, which of course
>> > encodes
>> > to 0xEF/0xBB/0xBF. But what about dumb encoders that encode a big-endian
>> > UTF-16 sequence into UTF-8 and emit a byte-swapped encoding for the BOM?
>> >
>> >
>> > Are you saying that the encoder actually emitted U+FFFE instead of
>> > U+FEFF?
>> > Ugh.
>> >
>> >
>> > The problem is that in the early days of Unicode, a 16-bit codepoint
>> > space
>> > was assumed and UCS2 was the assumed encoding, where a single UCS2
>> > 16-bit
>> > code value was assumed to be a single codepoint. Made string length
>> > computation easy etc. Then Unicode overflowed 16-bits for codepoints and
>> > UTF-16 with surrogates was invented. This means a lot of old code simply
>> > assumed (and still does) that a UCS2 encoding *is* just an array of
>> > codepoints. And so encoding to UTF-8 is assumed to just be encoded UCS2
>> > ..
>> > urgh. So surrogates slip through into the UTF-8 stream, and so can a BOM
>> > even if it’s encoded big-endian.
>> >
>> > —Tim
>> >
>>
>> Oh, no, I get THAT much. That's the easy part to understand. The hard
>> part to understand is how the data got byte-swapped in the first
>> place. It implies that it isn't even being treated as an array of
>> codepoints, but just an array of uint16s. It further implies that the
>> UTF-8 was generated by a system that would have been looking at what
>> appeared to be garbage data in the first place.
>>
>> /s/ Adam
>>
>