Re: xml/rapidxml UTF-8 validity assurance (was: [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3)

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: xml/rapidxml UTF-8 validity assurance (was: [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3)
From: Tim Hill <drtimhill@...>
Date: Mon, 11 May 2015 14:31:23 -0700

On May 11, 2015, at 1:47 PM, Coda Highland <chighland@gmail.com> wrote:

Well that’s true of the ZWNBSP *codepoint* U+FEFF, which of course encodes
to 0xEF/0xBB/0xBF. But what about dumb encoders that encode a big-endian
UTF-16 sequence into UTF-8 and emit a byte-swapped encoding for the BOM?

Are you saying that the encoder actually emitted U+FFFE instead of U+FEFF? Ugh.

The problem is that in the early days of Unicode, a 16-bit codepoint space was assumed and UCS2 was the assumed encoding, where a single UCS2 16-bit code value was assumed to be a single codepoint. Made string length computation easy etc. Then Unicode overflowed 16-bits for codepoints and UTF-16 with surrogates was invented. This means a lot of old code simply assumed (and still does) that a UCS2 encoding *is* just an array of codepoints. And so encoding to UTF-8 is assumed to just be encoded UCS2 .. urgh. So surrogates slip through into the UTF-8 stream, and so can a BOM even if it’s encoded big-endian.

—Tim

Follow-Ups:
- Re: xml/rapidxml UTF-8 validity assurance (was: [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3), Coda Highland

References:
- [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3, Gaspard Bucher
- xml/rapidxml UTF-8 validity assurance (was: [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3), Jay Carlson
- Re: xml/rapidxml UTF-8 validity assurance (was: [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3), Tim Hill
- Re: xml/rapidxml UTF-8 validity assurance (was: [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3), Coda Highland
- Re: xml/rapidxml UTF-8 validity assurance (was: [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3), Tim Hill
- Re: xml/rapidxml UTF-8 validity assurance (was: [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3), Coda Highland

Prev by Date: Re: GUI Toolbar is hiding....
Next by Date: Re: xml/rapidxml UTF-8 validity assurance (was: [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3)
Previous by thread: Re: xml/rapidxml UTF-8 validity assurance (was: [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3)
Next by thread: Re: xml/rapidxml UTF-8 validity assurance (was: [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3)
Index(es):
- Date
- Thread