[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: [ANN] SLAXML - pure Lua, robust-ish, SAX-like streaming XML processor
- From: Gavin Kistner <phrogz@...>
- Date: Sun, 17 Feb 2013 15:07:26 -0700
For a [separate project][1] I needed a pure-Lua XML parser. Unsatisfied with my previous [quick-n-dirty pattern-based XML parser][2], I've created a streaming parser that is far more robust. I give you:
SLAXML - https://github.com/Phrogz/SLAXML
(It's pronounced "Slacks-Em-Ell")
Copy/pasting key sections from the README:
## Features
* Pure Lua in a single file (two files if you use the DOM parser).
* Streaming parser does a single pass through the input and reports what it sees along the way.
* Supports processing instructions (`<?foo bar?>`).
* Supports comments (`<!-- hello world -->`).
* Supports CDATA sections (`<![CDATA[ whoa <xml> & other content as text ]]>`).
* Supports namespaces, resolving prefixes to the proper namespace URI (`<foo xmlns="bar">` and `<wrap xmlns:bar="bar"><bar:kittens/></wrap>`).
* Supports unescaped greater-than symbols in attribute content (a common failing for simpler pattern-based parsers).
* Unescapes named XML entities (`< > & " '`) and numeric entities (e.g. ` `) in attributes and text nodes (but—properly—not in comments or CDATA). Properly handles edge cases like `&amp;`.
* Optionally ignore whitespace-only text nodes (as appear when indenting XML markup).
* Includes a DOM parser that is a both a convenient way to pull in XML to use as well as a nice example of using the streaming parser.
* Adds only a single `SLAXML` key to the environment; there is no spam of utility functions polluting the global namespace.
## Usage
require 'slaxml'
local myxml = io.open('my.xml'):read()
-- Specify as many/few of these as you like
parser = SLAXML:parser{
startElement = function(name,nsURI) end, -- When "<foo" or <x:foo is seen
attribute = function(name,value,nsURI) end, -- attribute found on current element
closeElement = function(name,nsURI) end, -- When "</foo>" or </x:foo> or "/>" is seen
text = function(text) end, -- text and CDATA nodes
comment = function(content) end, -- comments
pi = function(target,content) end, -- processing instructions e.g. "<?yes mon?>"
namespace = function(nsURI) end, -- when xmlns="..." is seen (after startElement)
}
-- Ignore whitespace-only text nodes and strip leading/trailing whitespace from text
-- (does not strip leading/trailing whitespace from CDATA)
parser:parse(myxml,{stripWhitespace=true})
If you just want to see if it will parse your document correctly, you can simply do:
require 'slaxml'
SLAXML:parse(myxml)
…which will cause SLAXML to use its built-in callbacks that print the results as seen.
## Known Limitations / TODO
- Does not require or enforce well-formed XML (silently ignores and consumes certain syntax errors)
- No support for entity expansion other than
`< > " ' &` and numeric ASCII entities like ` `
- XML Declarations (`<?xml version="1.x"?>`) are incorrectly reported
as Processing Instructions
- No support for DTDs
- No support for extended characters in element/attribute names
- No support for [XInclude](http://www.w3.org/TR/xinclude/)
[1]: https://github.com/Phrogz/LXSC
[2]: http://phrogz.net/lua/AKLOMParser.lua