lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

On Jul 11, 2012, at 10:36 AM, Laurent FAILLIE wrote:

> I found a problem as well with luaexpat (1.2.0-1). It treats Carriage returns as string and not as space and consequently creating some noisy entries in the resulting array.
> As example
> ---
> <SOAP-ENV:Body>[CR]
> <ser-root:CheckEntryPointsResponse xmlns:ser-root="";><ErrorMessages>[CR]
> <ArrayOfstringItem>Port FilePollingListener:/home/wmadm/usr/laurent/tst is disabled</ArrayOfstringItem>[CR]
> <ArrayOfstringItem>Task STAdmin.purge:purgeISData is suspended</ArrayOfstringItem>[CR]
> ---
> create something like :
> ---
> {
>   "
> "
>   {

This is not a bug in luaexpat; in fact this behavior is specifically documented. The last sentence of :

|   Note that even the new-line and tab characters are stored on the table.

This is proper behavior for a general XML processor. See XML 1.0 5e, Section 2.10: White Space Handling ( )

> An XML processor MUST always pass all characters in a document that are not markup through to the application.

(expat is not a validating XML processor, so the next sentence of the specification does not apply.)

If you do not want mixed content, the right place to handle it is in a LOM tree builder; after you see the first start child tag event inside an element, you can delete any preceding whitespace, and start ignoring any whitespace character data until the element closes. Better, assert() any character data is whitespace before ignoring it--this will catch violations of no-mixed-content.

The big misfeature (or bug) in LuaExpat is reporting attributes by both name and order. Order of attributes cannot be significant in XML, and having those numeric values there means pairs() can't be used to iterate over attributes. The first thing I do when receiving start tag events is nuke the damn array-part.