Lazy Tree

lua-users home
wiki

lazytree: lazily construct XML trees

(This is part of LazyKit.)

LazyTree constructs an conventional-seeming XmlTree as its contents are referenced, parsing more source document as necessary to fill out the contents on demand.

Why is this interesting? Trees are a natural data model to process XML documents. A simple tree implementation reads the entire document into memory at once. For large documents, this can be too expensive. Although callback and event APIs are memory-efficient, they are painful to program to.

Given stylized iterators, memory consumption can be limited to particular subtrees. Consider:

for i,child in xnpairs_c(tree) do
  if child.attr.href then 
    print(child.name)
    table.insert(references, child)
  end
end 

where the _c family of iterators nils out nodes from their parent before returning them. If the body of the loop does not retain a reference to the child elsewhere, it will become eligible for garbage collection as soon as the next iteration begins. See ConsumingXml.

Although not currently implemented, other consuming forms may interact with the XML parser for greater savings:

<document>
  <firstname>Jay</firstname>
  <lastname>Carlson</lastname>
  <bodytext>Spending too much time listening to <ref>In Utero</ref> can be [...]
  <title>I Think I'm DOM</title> 
lastname, title = xmlextract.strings_consume(tree, "lastname", "title") 
The strings_consume filter can potentially turn off character data and other events inside any node it knows it doesn't need (like bodytext), as references to them cannot possibly affect the rest of the program.

Dependencies

lazytree depends on lxpevent to generate event queues, which depends on LuaExpat?.

Cautions

Calling the normal ipairs iterator on a lazytree that has not been completely loaded will not work, as ipairs(lz) does not directly reference lz.n. Use the XmlIter iterators xpairs{_c} and xnpairs{_c}; the second is more convenient anyway. See XmlIter.

Usage

lazytree.parsestring(s)

Returns a tree lazily parsed from the string s.

lazytree.parsefile(file)

Returns a tree lazily parsed from file. If file is a string, it is interpreted as a filename and opened; otherwise, file is treated as an io library file object.

lazytree.parseevents(event_source)

Returns a tree lazily parsed from the lxpevent event_source.

lazytree.load(tree)

Force the entire contents of tree to be read in. It is safe to call this on non-lazy trees.

lazytree.consume(tree)

Indicate tree is no longer needed and may be destroyed. tree may be either a lazytree or a conventional tree, and should be the last reference to it.

Although not currently implemented, calling consume on the part of the lazytree currently being built could tell the lazy parser not to bother populating that portion of the tree. This is not intended as a general user tool; rather, it is a primitive that can be used by consuming filtering iterators such as xmliter.switch when they notice that a tree they have encountered will be skipped and have no visibility in the application.

lazytree.lazyprint(tree)

Prints the current contents of a lazytree without further parsing. Useful for demonstration purposes.

Implementation details

Given the following XML:

<paragraph justify='centered'>first child<b>bold</b>second child</paragraph> 
A lazytree will appear to have the following contents:

lz = {name="paragraph", attr={justify="centered"}, 
  "first child", 
  {name="b", "bold", n=1},
  "second child",
  n=3
} 
However, on the start of parsing, the actual underlying table will contain:

lz = {name="paragraph", attr={justify="centered"}, 
  _read_so_far=0
} 
After a reference to lz[1], it will contain:

lz = {name="paragraph", attr={justify="centered"}, 
  "first child",
  _read_so_far=1
} 
And after a reference to lz[2]:

lz = {name="paragraph", attr={justify="centered"}, 
  "first child",
  {name="b", _read_so_far=0}
} 
Note that the child is read lazily as well. However, a reference to lz[3] will force all of lz[2] to be completed:

lz = {name="paragraph", attr={justify="centered"}, 
  "first child",
  {name="b", "bold", n=1}
  "second child",
  _read_so_far=3
} 
Reading either lz[4] (which is nil) or lz.n will force the completion of the tree.

Note that reading from lz.n will force the remainder of the tree to be read, as we don't know how long it's going to be until it closes.


RecentChanges · preferences
edit · history
Last edited February 29, 2004 12:33 am GMT (diff)