Re: Parsing big compressed XML files

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Parsing big compressed XML files
From: Luiz Henrique de Figueiredo <lhf@...>
Date: Thu, 3 Apr 2014 09:28:27 -0300

> I was wondering if you are aware of an efficient way to parse a big
> compressed XML file with Lua (or LuaJIT).
> The files store the wikipedia dumps:
> http://dumps.wikimedia.org/enwiki/20140304/ (all those files with name
> pages-meta-history).
> The parsing consists in extracting the timestamp of the each of the
> versions for a page in the dump.

Since these XML files seem to be formatted with one field by line,
a simple loop like the one below should suffice.

for l in io.lines("enwiki-20140304-pages-meta-current1.xml-p000000010p000010000") do
	if l:match("<timestamp>") 
	or l:match("<title>")
	or l:match("<id>")
	then
		print(l)
	end
end

Of course, you probably need to do a more sophisticated parsing.

Follow-Ups:
- Re: Parsing big compressed XML files, Valerio Schiavoni

References:
- Parsing big compressed XML files, Valerio Schiavoni

Prev by Date: Re: Parsing big compressed XML files
Next by Date: Re: [ANN] Lua 5.3.0 (work2) now available
Previous by thread: Re: Parsing big compressed XML files
Next by thread: Re: Parsing big compressed XML files
Index(es):
- Date
- Thread