Re: xml/rapidxml UTF-8 validity assurance (was: [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3)

Note that conforming XML applications MUST (in the formal sense) immediately stop processing at non-Chars. This is stricter than UTF-8 or Unicode: there is no way to represent a codepoint zero in XML.

In 2015, nearly everybody is subsetting XML, so the full standards are getting weaker. Internal DOCTYPEs are just asking for trouble. Nobody wants to implement the whole thing.

What the well-formedness constraints like charset mean in the real world: you can't complain if somebody downstream from you *does* strictly abort on your bad output, and there are many applications which will do so automatically because of tooling. Anything using expat, for example.

If you don't abort, perhaps on the grounds of "be liberal in what you accept," you can get really nailed on "be conservative in what you send."

(who is stuck using a phone, because his second-oldest SSD is now dying a very weird death)

On May 11, 2015 5:59 PM, "Coda Highland" <chighland@gmail.com> wrote:

On Mon, May 11, 2015 at 2:31 PM, Tim Hill <drtimhill@gmail.com> wrote:
>
> On May 11, 2015, at 1:47 PM, Coda Highland <chighland@gmail.com> wrote:
>
> Well that’s true of the ZWNBSP *codepoint* U+FEFF, which of course encodes
> to 0xEF/0xBB/0xBF. But what about dumb encoders that encode a big-endian
> UTF-16 sequence into UTF-8 and emit a byte-swapped encoding for the BOM?
>
>
> Are you saying that the encoder actually emitted U+FFFE instead of U+FEFF?
> Ugh.
>
>
> The problem is that in the early days of Unicode, a 16-bit codepoint space
> was assumed and UCS2 was the assumed encoding, where a single UCS2 16-bit
> code value was assumed to be a single codepoint. Made string length
> computation easy etc. Then Unicode overflowed 16-bits for codepoints and
> UTF-16 with surrogates was invented. This means a lot of old code simply
> assumed (and still does) that a UCS2 encoding *is* just an array of
> codepoints. And so encoding to UTF-8 is assumed to just be encoded UCS2 ..
> urgh. So surrogates slip through into the UTF-8 stream, and so can a BOM
> even if it’s encoded big-endian.
>
> —Tim
>

Oh, no, I get THAT much. That's the easy part to understand. The hard
part to understand is how the data got byte-swapped in the first
place. It implies that it isn't even being treated as an array of
codepoints, but just an array of uint16s. It further implies that the
UTF-8 was generated by a system that would have been looking at what
appeared to be garbage data in the first place.

/s/ Adam