Re: xml/rapidxml UTF-8 validity assurance (was: [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3)

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: xml/rapidxml UTF-8 validity assurance (was: [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3)
From: Tim Hill <drtimhill@...>
Date: Mon, 11 May 2015 13:29:48 -0700

On May 11, 2015, at 11:54 AM, Coda Highland <chighland@gmail.com> wrote:

— UTF-8 is sometimes used to encode UTF-16 values (such as BOM), some of which are now accepted. Reject/accept?

ZWNBSP (that is, BOM) is a perfectly legit character in UTF-8. Accept
it. I get pissed at decoders that choke on it.

Well that’s true of the ZWNBSP *codepoint* U+FEFF, which of course encodes to 0xEF/0xBB/0xBF. But what about dumb encoders that encode a big-endian UTF-16 sequence into UTF-8 and emit a byte-swapped encoding for the BOM?

The problem is UTF-8 *should* be used to decode: UTF-8 -> codepoint array. Instead its (shudder) often used to decode UTF-8 -> UTF-16 -> (byte-swap based on BOM) -> codepoint array.

It’s one reason I detest Unicode.

—Tim

Follow-Ups:
- Re: xml/rapidxml UTF-8 validity assurance (was: [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3), Javier Guerra Giraldez
- Re: xml/rapidxml UTF-8 validity assurance (was: [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3), Coda Highland

References:
- [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3, Gaspard Bucher
- xml/rapidxml UTF-8 validity assurance (was: [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3), Jay Carlson
- Re: xml/rapidxml UTF-8 validity assurance (was: [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3), Tim Hill
- Re: xml/rapidxml UTF-8 validity assurance (was: [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3), Coda Highland

Prev by Date: Re: Drawing the line between speed and simplicity/elegance
Next by Date: Re: xml/rapidxml UTF-8 validity assurance (was: [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3)
Previous by thread: Re: xml/rapidxml UTF-8 validity assurance (was: [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3)
Next by thread: Re: xml/rapidxml UTF-8 validity assurance (was: [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3)
Index(es):
- Date
- Thread