[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: XML parser with DOM-like API
- From: Jay Carlson <nop@...>
- Date: Tue, 11 Oct 2011 13:31:22 -0400
On Mon, Oct 10, 2011 at 4:30 PM, Florian Weimer <email@example.com> wrote:
> * Matthew Wild:
>> http://matthewwild.co.uk/projects/luaexpat/lom.html - I'm not sure it
>> can get much simpler (except something like XPath maybe...).
LOM is as close as there is to consensus/convention for tree-like
representations of XML documents in Lua. IMO "tag" is a bad name, as
<a b='c'> is a tag. On the other hand the people who most care are
ones who enjoy the HyTime spec as bedtime reading, and personally I'm
willing to let that choice of nomenclature go in order to have a
metalua uses "tag" as its table pattern matching key, so maybe my
taste is off anyway. Speaking of pattern matching, working with trees
may make you wish for better control structure syntax.
http://lua-users.org/wiki/XmlIter 's xmliter.switch is about the best
I could do for pattern match control structures in standard Lua
syntax. I still think providing "view" objects as proxies to LOM nodes
has merit as in http://lua-users.org/wiki/XmlView .[#]
> I would use a different encoding: node for the tag, node to
> node[#node] for the children,
I argue this with myself in
> and node[attr] for the value of the
> attribute attr (a string).
This keeps node:f() from ever working, and makes life more difficult
for annotation/decoration of nodes.
> This works because the order of attributes
> does not matter.
Perhaps including ordering in the attribute table serves a useful
teaching purpose in the Lua book's lxp library, but it is a bad
misdesign for people writing code to process XML. Attribute order
isn't part of the infoset, meaning XML programs processing XML can't
derive any meaning from the order of attributes. It only serves to
make loops iterating over pairs(node.attr) more complicated.
> However, that is quite distinct from the DOM API. 8-)
The DOM was a horrible, horrible disaster for XML. I would say it
set back XML and the Web by a year but a full accounting of its damage
is not obvious nor complete.
: Code that cares about attribute order is pretty much restricted
to things like editors, and at that point you need to a) know to care
about things like comments and CDATA and PIs and b) know not to care
about them when extracting meaning from the document. If you don't
understand that CDATA is syntactic sugar for <> quoting you have
no business asking for it, and you especially have no business
writing, oh, RSS specs and implementations.
: See , and the overwhelming bulk of document-manipulating code
has no clue at all about points "a" and "b" yet often bears complexity
costs. The DOM by charter is not idiomatic code in any language;
amounts of time, money, and blood pressure medication, it perhaps was
unthinkable as it would not map directly to the COM ecosystem and
would have made VBScript look even worse. If the DOM had a motto it
would be "to level the playing field and make everything suck just as
much as other COM manipulation." Not that Java had its Collection
house in order either.
: James Clark did an enormous service by writing a free,
world-class parser that handled errors strictly and made the path of
least resistance working with normalized documents--this kept a lot of
other people honest and helped herd (most) of the XML ecosystem away
from wandering back into tag soup.
[#]: An example of a view:
v = xmlview.string(parse[[<l>
print(v.m) => INFO
print(v.dup) => error "duplicate content"
print(v.mixed) => error "mixed content"