[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Simple XHTML (XML) parser/printer
- From: Edgar Toernig <froese@...>
- Date: Sun, 22 Mar 2009 23:39:45 +0100
Tuomo Valkonen wrote:
>
> I'm looking for a simple XML parser/printer for Lua, that
> would let me manipulate an (incomplete) XHTML document
> with little effort.
I once wrote a simple (200 lines) XML-lexer for my version
of Lua (Sol). It adds a new pattern to the read method ("*x")
to return the next "XML token". It's not a validating lexer
but it should handle well-formed XML and HTML.
Usage:
token, name, value = file_handle:read"*x"
The return values are:
token=nil
End of file
token="TAG" name="TAGNAME" value={ATTR="value1", ...} or nil
A tag with optional attributes (<tagname attr="value1">)
token="ENDTAG" name="TAGNAME"
An end tag (</tagname>)
token="EMPTYTAG" name="TAGNAME" value={ATTR="value1",... } or nil
Same as TAG but with a trailing "/>"
token="DATA" name="all the text"
All the literal text between two tags etc.
token="SPACE" name="some white space"
Same as DATA but the text contains only white space.
token="PI" name="PINAME" value="all the data"
Processing instruction (<?piname all the data?>
token="COMMENT" name="the comment text"
A comment (<!--the comment text-->)
token="BRDECL" name="NAME" value="all the data"
A bracket declaration (<![name [all the data]]>)
token="DECL" name="NAME value="all the data"
A declaration (<!name all the data>)
So for example this input
<?xml version="1.0" encoding="UTF-8" ?>
Foo bar <
<foo a="val" b c="more"/>
<bar>end</bar>
gives (each line one read"*x"):
PI XML 'version="1.0" encoding="UTF-8" '
DATA '\nFoo bar <\n'
EMPTYTAG FOO { A="val", B=1, C="more" }
SPACE '\n'
TAG BAR nil
DATA 'end'
ENDTAG BAR
nil
You can find the lexer in this file:
http://goron.de/~froese/siolib.c
The code is marked with WITH_XML_LEXER ifdefs. As I said,
it's for Sol but porting it to Lua should be easy (the
corresponding Lua file would be src/lib/liolib.c).
Btw, the file also contains a CSV reader ("*c") and a binary
reader ("*b") ...
Ciao, ET.