lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


For a [separate project][1] I needed a pure-Lua XML parser. Unsatisfied with my previous [quick-n-dirty pattern-based XML parser][2], I've created a streaming parser that is far more robust. I give you:

SLAXML - https://github.com/Phrogz/SLAXML
(It's pronounced "Slacks-Em-Ell")

Copy/pasting key sections from the README:

## Features

* Pure Lua in a single file (two files if you use the DOM parser).
* Streaming parser does a single pass through the input and reports what it sees along the way.
* Supports processing instructions (`<?foo bar?>`).
* Supports comments (`<!-- hello world -->`).
* Supports CDATA sections (`<![CDATA[ whoa <xml> & other content as text ]]>`).
* Supports namespaces, resolving prefixes to the proper namespace URI (`<foo xmlns="bar">` and `<wrap xmlns:bar="bar"><bar:kittens/></wrap>`).
* Supports unescaped greater-than symbols in attribute content (a common failing for simpler pattern-based parsers).
* Unescapes named XML entities (`&lt; &gt; &amp; &quot; &apos;`) and numeric entities (e.g. `&#10;`) in attributes and text nodes (but—properly—not in comments or CDATA). Properly handles edge cases like `&#38;amp;`.
* Optionally ignore whitespace-only text nodes (as appear when indenting XML markup).
* Includes a DOM parser that is a both a convenient way to pull in XML to use as well as a nice example of using the streaming parser.
* Adds only a single `SLAXML` key to the environment; there is no spam of utility functions polluting the global namespace.

## Usage
    require 'slaxml'

    local myxml = io.open('my.xml'):read()

    -- Specify as many/few of these as you like
    parser = SLAXML:parser{
      startElement = function(name,nsURI)       end, -- When "<foo" or <x:foo is seen
      attribute    = function(name,value,nsURI) end, -- attribute found on current element
      closeElement = function(name,nsURI)       end, -- When "</foo>" or </x:foo> or "/>" is seen
      text         = function(text)             end, -- text and CDATA nodes
      comment      = function(content)          end, -- comments
      pi           = function(target,content)   end, -- processing instructions e.g. "<?yes mon?>"
      namespace    = function(nsURI)            end, -- when xmlns="..." is seen (after startElement)
    }

    -- Ignore whitespace-only text nodes and strip leading/trailing whitespace from text
    -- (does not strip leading/trailing whitespace from CDATA)
    parser:parse(myxml,{stripWhitespace=true})

If you just want to see if it will parse your document correctly, you can simply do:

    require 'slaxml'
    SLAXML:parse(myxml)

…which will cause SLAXML to use its built-in callbacks that print the results as seen.

## Known Limitations / TODO
- Does not require or enforce well-formed XML (silently ignores and consumes certain syntax errors)
- No support for entity expansion other than
  `&lt; &gt; &quot; &apos; &amp;` and numeric ASCII entities like `&#10;`
- XML Declarations (`<?xml version="1.x"?>`) are incorrectly reported
  as Processing Instructions
- No support for DTDs
- No support for extended characters in element/attribute names
- No support for [XInclude](http://www.w3.org/TR/xinclude/)



[1]: https://github.com/Phrogz/LXSC
[2]: http://phrogz.net/lua/AKLOMParser.lua