Re: xml/rapidxml UTF-8 validity assurance (was: [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3)

I understand that it sounds good to have some utf-8 validity check on input (not sure if it is on the encode or decode end BTW). But different tools have different goals. Quoting rapidxml:

RapidXml is an attempt to create the fastest XML parser possible, while retaining useability, portability and reasonable W3C compatibility.

If some environment needs the extra security check (probably on xml decoding), it seems easy to write such a validator after the parser in a separate lua library. Or if this is really a big performance penalty, I could include the optional validation in the binding with rapidxml, provided I understand enough of utf-8 validity to not provide the feeling of security without actually ensuring it !

http://codereview.stackexchange.com/questions/8406/improving-a-utf-8-validator

A simple Lua library "utf8validator" seems more appropriate and would ease contributions to make the code really safe instead of patching all decoding libraries... If the validator runs before decoding, that's a very small overhead. It's then very easy to "patch" xml on environments needing the validator with:

local validate = require 'utf8validator'

local orig_load = xml.load

function xml.load(string)

return orig_load(validate(string))

end

The "validate" function could take an optional "invalid char handler" function as argument, letting end users decide what to do on invalid characters instead of blowing.

my 2c...

Gaspard

Gaspard Bucher

teti sàrl

On Tue, May 12, 2015 at 5:50 AM, Coda Highland <chighland@gmail.com> wrote:

... at least I think it's expat... maybe it's libxml2... It's some
common XML library that barfs on me.

/s/ Adam

On Mon, May 11, 2015 at 8:49 PM, Coda Highland <chighland@gmail.com> wrote:
> expat aborts on a UTF-8 BOM, which is in violation of the XML spec.
> Drives me bonkers; I can't let anyone on Windows edit my XML files.
>
> /s/ Adam
>
> On Mon, May 11, 2015 at 6:21 PM, Jay Carlson <nop@nop.com> wrote:
>> Note that conforming XML applications MUST (in the formal sense) immediately
>> stop processing at non-Chars. This is stricter than UTF-8 or Unicode: there
>> is no way to represent a codepoint zero in XML.
>>
>> In 2015, nearly everybody is subsetting XML, so the full standards are
>> getting weaker. Internal DOCTYPEs are just asking for trouble. Nobody wants
>> to implement the whole thing.
>>
>> What the well-formedness constraints like charset mean in the real world:
>> you can't complain if somebody downstream from you *does* strictly abort on
>> your bad output, and there are many applications which will do so
>> automatically because of tooling. Anything using expat, for example.
>>
>> If you don't abort, perhaps on the grounds of "be liberal in what you
>> accept," you can get really nailed on "be conservative in what you send."
>>
>> Jay
>>
>> (who is stuck using a phone, because his second-oldest SSD is now dying a
>> very weird death)
>>
>> On May 11, 2015 5:59 PM, "Coda Highland" <chighland@gmail.com> wrote:
>>>
>>> On Mon, May 11, 2015 at 2:31 PM, Tim Hill <drtimhill@gmail.com> wrote:
>>> >
>>> > On May 11, 2015, at 1:47 PM, Coda Highland <chighland@gmail.com> wrote:
>>> >
>>> > Well that’s true of the ZWNBSP *codepoint* U+FEFF, which of course
>>> > encodes
>>> > to 0xEF/0xBB/0xBF. But what about dumb encoders that encode a big-endian
>>> > UTF-16 sequence into UTF-8 and emit a byte-swapped encoding for the BOM?
>>> >
>>> >
>>> > Are you saying that the encoder actually emitted U+FFFE instead of
>>> > U+FEFF?
>>> > Ugh.
>>> >
>>> >
>>> > The problem is that in the early days of Unicode, a 16-bit codepoint
>>> > space
>>> > was assumed and UCS2 was the assumed encoding, where a single UCS2
>>> > 16-bit
>>> > code value was assumed to be a single codepoint. Made string length
>>> > computation easy etc. Then Unicode overflowed 16-bits for codepoints and
>>> > UTF-16 with surrogates was invented. This means a lot of old code simply
>>> > assumed (and still does) that a UCS2 encoding *is* just an array of
>>> > codepoints. And so encoding to UTF-8 is assumed to just be encoded UCS2
>>> > ..
>>> > urgh. So surrogates slip through into the UTF-8 stream, and so can a BOM
>>> > even if it’s encoded big-endian.
>>> >
>>> > —Tim
>>> >
>>>
>>> Oh, no, I get THAT much. That's the easy part to understand. The hard
>>> part to understand is how the data got byte-swapped in the first
>>> place. It implies that it isn't even being treated as an array of
>>> codepoints, but just an array of uint16s. It further implies that the
>>> UTF-8 was generated by a system that would have been looking at what
>>> appeared to be garbage data in the first place.
>>>
>>> /s/ Adam
>>>
>>