Xml Tree

lua-users home
wiki

xmltree: a mid-level structure for XML

(This document was originally part of LazyKit.)

This document describes a mid-level representation of XML as trees in Lua. It is intended to be saved into the lua-users wiki as a place to remember discussion on this subject.

The representation is intended to describe data from the XML Infoset, yet remain simple to work with in idiomatic Lua code. The representation is an interface, allowing fancy implementations to use metatables to provide the same interface as bare tables.

LxpTree and LazyTree implement this.

Spec by example

<paragraph justify='centered'>first child<b>bold</b>second child</paragraph> 
lz = {name="paragraph", attr={justify="centered"}, 
  "first child",
  {name="b", "bold", n=1}
  "second child",
  n=3
} 

Strawman spec

A tree is a Lua table representation of an element and its contents. The table must have a name key, giving the element name.

The tree may have a attr key, which gives a table of all of the attributes of the element. Only string keys are relevant. (LuaExpat? uses numeric keys to mark attributes that were defaulted from the DTD.) A convenience iterator like xattrpairs(tree) should be provided.

If the element is not empty, each child node is contained in tree[1], tree[2], etc. Child nodes may be either strings, denoting character data content, or other trees.

Parsers should try to merge adjacent character data content. That is, they should avoid producing something like:

{name="p", "Hello w", "orld"} 

Parsers should include an n key, giving the number of child nodes. However, to be tolerant of tree literals in code, general-purpose processing code should use code like

tree.n or table.getn(tree) 

(found as xmliter.getn(tree)), in the same way they would use table.getn(list) on normal lists instead of list.n.

(Why a separate getn? This is necessary because table.getn(tree) does not explicitly call for tree.n, instead using rawget(tree, "n"). Fancy tree implementations may need to use a metatable call to find the number of children.)

Things explicitly not modeled

Syntactic details of XML source files are out of scope. To wit:

The order of attributes on elements is unimportant.

The presence of a CDATA section is not interesting; it is just another way to write character data.

Comments are not interesting.

The source of attributes, whether explicit or specified in a DTD is not interesting.

Things explicitly modeled

All elements, regardless of duplicates.

All character data. That includes mixed content.

The order of the above.

Things that should be modeled

DTD. This could go in root.dtd.

Encoding. However, declaring everything to be in UTF-8 might not be so bad---especially for USASCII users....

Namespaces. I don't have enough experience with them to propose a design.


RecentChanges · preferences
edit · history
Last edited February 28, 2004 11:28 pm GMT (diff)