lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Mon, May 11, 2015 at 2:31 PM, Tim Hill <drtimhill@gmail.com> wrote:
>
> On May 11, 2015, at 1:47 PM, Coda Highland <chighland@gmail.com> wrote:
>
> Well that’s true of the ZWNBSP *codepoint* U+FEFF, which of course encodes
> to 0xEF/0xBB/0xBF. But what about dumb encoders that encode a big-endian
> UTF-16 sequence into UTF-8 and emit a byte-swapped encoding for the BOM?
>
>
> Are you saying that the encoder actually emitted U+FFFE instead of U+FEFF?
> Ugh.
>
>
> The problem is that in the early days of Unicode, a 16-bit codepoint space
> was assumed and UCS2 was the assumed encoding, where a single UCS2 16-bit
> code value was assumed to be a single codepoint. Made string length
> computation easy etc. Then Unicode overflowed 16-bits for codepoints and
> UTF-16 with surrogates was invented. This means a lot of old code simply
> assumed (and still does) that a UCS2 encoding *is* just an array of
> codepoints. And so encoding to UTF-8 is assumed to just be encoded UCS2 ..
> urgh. So surrogates slip through into the UTF-8 stream, and so can a BOM
> even if it’s encoded big-endian.
>
> —Tim
>

Oh, no, I get THAT much. That's the easy part to understand. The hard
part to understand is how the data got byte-swapped in the first
place. It implies that it isn't even being treated as an array of
codepoints, but just an array of uint16s. It further implies that the
UTF-8 was generated by a system that would have been looking at what
appeared to be garbage data in the first place.

/s/ Adam