lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


> I was wondering if you are aware of an efficient way to parse a big
> compressed XML file with Lua (or LuaJIT).
> The files store the wikipedia dumps:
> http://dumps.wikimedia.org/enwiki/20140304/ (all those files with name
> pages-meta-history).
> The parsing consists in extracting the timestamp of the each of the
> versions for a page in the dump.

Since these XML files seem to be formatted with one field by line,
a simple loop like the one below should suffice.

for l in io.lines("enwiki-20140304-pages-meta-current1.xml-p000000010p000010000") do
	if l:match("<timestamp>") 
	or l:match("<title>")
	or l:match("<id>")
	then
		print(l)
	end
end

Of course, you probably need to do a more sophisticated parsing.