lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Tuomo Valkonen wrote:
>
> I'm looking for a simple XML parser/printer for Lua, that
> would let me manipulate an (incomplete) XHTML document 
> with little effort.

I once wrote a simple (200 lines) XML-lexer for my version
of Lua (Sol).  It adds a new pattern to the read method ("*x")
to return the next "XML token".  It's not a validating lexer
but it should handle well-formed XML and HTML.

Usage:
	token, name, value = file_handle:read"*x"

The return values are:

	token=nil
		End of file

	token="TAG" name="TAGNAME" value={ATTR="value1", ...} or nil
		A tag with optional attributes (<tagname attr="value1">)

	token="ENDTAG" name="TAGNAME"
		An end tag (</tagname>)

	token="EMPTYTAG" name="TAGNAME" value={ATTR="value1",... } or nil
		Same as TAG but with a trailing "/>"

	token="DATA" name="all the text"
		All the literal text between two tags etc.

	token="SPACE" name="some white space"
		Same as DATA but the text contains only white space.

	token="PI" name="PINAME" value="all the data"
		Processing instruction (<?piname all the data?>

	token="COMMENT" name="the comment text" 
		A comment (<!--the comment text-->)

	token="BRDECL" name="NAME" value="all the data"
		A bracket declaration (<![name [all the data]]>)

	token="DECL" name="NAME value="all the data"
		A declaration (<!name all the data>)

So for example this input

	<?xml version="1.0" encoding="UTF-8" ?>
	Foo bar &lt;
	<foo a="val" b c="more"/>
	<bar>end</bar>

gives (each line one read"*x"):
	PI XML 'version="1.0" encoding="UTF-8" '
	DATA '\nFoo bar &lt;\n'
	EMPTYTAG FOO { A="val", B=1, C="more" }
	SPACE '\n'
	TAG BAR nil
	DATA 'end'
	ENDTAG BAR
	nil

You can find the lexer in this file:

	http://goron.de/~froese/siolib.c

The code is marked with WITH_XML_LEXER ifdefs.  As I said,
it's for Sol but porting it to Lua should be easy (the
corresponding Lua file would be src/lib/liolib.c).

Btw, the file also contains a CSV reader ("*c") and a binary
reader ("*b") ...

Ciao, ET.