lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Title: RE: lazy xml teaser

Very nice, and would handle some jabber-xml stuff I'm already doing quite nicely.

Need any help with this?

-joe

-----Original Message-----
From: Jay Carlson [mailto:nop@nop.com]
Sent: Sunday, February 22, 2004 10:41 PM
To: 'Lua list'
Subject: lazy xml teaser


Here's the start of the docs for something I've been working on.  Lemme
know what you think.

lazynode: lazily construct XML trees

Why is this interesting?  Trees are a natural data model to process
XML documents.  A simple tree implementation reads the entire document
into memory at once.  For large documents, this can be too expensive.
Although callback and event APIs are memory-efficient, they are
painful to program to.

Lazynode constructs an conventional-seeming XML tree as its contents
are referenced, parsing more source document as necessary to fill out
the contents on demand.

Given stylized iterators, memory consumption can be limited to
particular subtrees.  Consider:

for i,child in xpairs_consume(lz) do
   if child.attr.externalref then
     print(child.name)
     table.insert(references, child)
   end
end

where the _consume family of iterators nils out elements from their
parent before returning them.  If the body of the loop does not retain
a reference to the child elsewhere, it will become eligible for
garbage collection as soon as the next iteration begins.

Although not currently implemented, other consuming forms may interact
with the XML parser for greater savings:

<document>
   <firstname>Jay</firstname>
   <lastname>Carlson</lastname>
   <bodytext>Spending too much time listening to <ref>In Utero</ref> can
be [...]
   <title>I Think I'm DOM</title>
</document>

lastname, title = xmlextract.strings_consume(tree, "lastname", "title")

The strings_consume filter can potentially turn off character data and
other events inside any node it knows it doesn't need (like bodytext),
as references to them cannot possibly affect the rest of the program.

Implementation details:

Given the following XML:

   <paragraph justify='centered'>first child<b>bold</b>second
child</paragraph>

A lazynode will appear to have the following contents:

lz = {name="paragraph", attr={justify="centered"},
   "first child",
   {name="b", "bold", n=1},
   "second child",
   n=3
}

However, on the start of parsing, the actual underlying table will contain:

lz = {name="paragraph", attr={justify="centered"},
   _read_so_far=0
}

After a reference to lz[1], it will contain:

lz = {name="paragraph", attr={justify="centered"},
   "first child",
   _read_so_far=1
}

   And after a reference to lz[2]:

lz = {name="paragraph", attr={justify="centered"},
   "first child",
   {name="b", _read_so_far=0}
}

Note that the child is read lazily as well.  However, a reference to lz[3]
will force all of lz[2] to be completed:


lz = {name="paragraph", attr={justify="centered"},
   "first child",
   {name="b", "bold", n=1}
   "second child",
   _read_so_far=3
}

Reading either lz[4] (which is nil) or lz.n will force the completion
of the node.

Note that reading from lz.n will force the remainder of the node to be
read, as we don't know how long it's going to be until it closes.

In general, calling normal iterators on a lazy node is a bad idea, as
all kinds of goo may be present.

Under the surface, this is implemented by pulling events from a queue
generated from expat.



- - - - - - - Appended by Scientific-Atlanta, Inc. - - - - - - -
This e-mail and any attachments may contain information which is confidential, proprietary, privileged or otherwise protected by law. The information is solely intended for the named addressee (or a person responsible for delivering it to the addressee). If you are not the intended recipient of this message, you are not authorized to read, print, retain, copy or disseminate this message or any part of it. If you have received this e-mail in error, please notify the sender immediately by return e-mail and delete it from your computer.