lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


... at least I think it's expat... maybe it's libxml2... It's some
common XML library that barfs on me.

/s/ Adam

On Mon, May 11, 2015 at 8:49 PM, Coda Highland <chighland@gmail.com> wrote:
> expat aborts on a UTF-8 BOM, which is in violation of the XML spec.
> Drives me bonkers; I can't let anyone on Windows edit my XML files.
>
> /s/ Adam
>
> On Mon, May 11, 2015 at 6:21 PM, Jay Carlson <nop@nop.com> wrote:
>> Note that conforming XML applications MUST (in the formal sense) immediately
>> stop processing at non-Chars. This is stricter than UTF-8 or Unicode: there
>> is no way to represent a codepoint zero in XML.
>>
>> In 2015, nearly everybody is subsetting XML, so the full standards are
>> getting weaker. Internal DOCTYPEs are just asking for trouble. Nobody wants
>> to implement the whole thing.
>>
>> What the well-formedness constraints like charset mean in the real world:
>> you can't complain if somebody downstream from you *does* strictly abort on
>> your bad output, and there are many applications which will do so
>> automatically because of tooling. Anything using expat, for example.
>>
>> If you don't abort, perhaps on the grounds of "be liberal in what you
>> accept," you can get really nailed on "be conservative in what you send."
>>
>> Jay
>>
>> (who is stuck using a phone, because his second-oldest SSD is now dying a
>> very weird death)
>>
>> On May 11, 2015 5:59 PM, "Coda Highland" <chighland@gmail.com> wrote:
>>>
>>> On Mon, May 11, 2015 at 2:31 PM, Tim Hill <drtimhill@gmail.com> wrote:
>>> >
>>> > On May 11, 2015, at 1:47 PM, Coda Highland <chighland@gmail.com> wrote:
>>> >
>>> > Well that’s true of the ZWNBSP *codepoint* U+FEFF, which of course
>>> > encodes
>>> > to 0xEF/0xBB/0xBF. But what about dumb encoders that encode a big-endian
>>> > UTF-16 sequence into UTF-8 and emit a byte-swapped encoding for the BOM?
>>> >
>>> >
>>> > Are you saying that the encoder actually emitted U+FFFE instead of
>>> > U+FEFF?
>>> > Ugh.
>>> >
>>> >
>>> > The problem is that in the early days of Unicode, a 16-bit codepoint
>>> > space
>>> > was assumed and UCS2 was the assumed encoding, where a single UCS2
>>> > 16-bit
>>> > code value was assumed to be a single codepoint. Made string length
>>> > computation easy etc. Then Unicode overflowed 16-bits for codepoints and
>>> > UTF-16 with surrogates was invented. This means a lot of old code simply
>>> > assumed (and still does) that a UCS2 encoding *is* just an array of
>>> > codepoints. And so encoding to UTF-8 is assumed to just be encoded UCS2
>>> > ..
>>> > urgh. So surrogates slip through into the UTF-8 stream, and so can a BOM
>>> > even if it’s encoded big-endian.
>>> >
>>> > —Tim
>>> >
>>>
>>> Oh, no, I get THAT much. That's the easy part to understand. The hard
>>> part to understand is how the data got byte-swapped in the first
>>> place. It implies that it isn't even being treated as an array of
>>> codepoints, but just an array of uint16s. It further implies that the
>>> UTF-8 was generated by a system that would have been looking at what
>>> appeared to be garbage data in the first place.
>>>
>>> /s/ Adam
>>>
>>