2015-05-18 8:22 GMT+02:00 Dirk Laurie <dirk.laurie@gmail.com>:
2015-05-18 2:24 GMT+02:00 Tim Channon <tc@gpsl.net>:
There are many XML decoding libraries of varying degrees of capability and
portability. The focus of these is decode.
Discussion of regenerating identical XML from the decode seems to get lost
and particularly when the user has no idea of the XML meaning, simply wants
to intercept textually known entities within a context.
I found it quite easy to regenerate identical XML from the output of Roberto's
parser on the lua-users.org Wiki, but in a very specialized context: the XML
files were those produced by `pdftohtml -xml`, for which the DTD is
self-contained and a mere 49 lines long. A far cry from SVG's over 300 lines
mainly pulling in other files, I'll admit.
Here is what I did: I tweaked Roberto's code to provide a metatable
for every table-valued item.
element_mt = { __tostring =
function(s)
if type(s)=='string' then return s end
assert(s.tag)
local render = render[s.tag]
if type(render)=='string' then return render
elseif type(render)=='function' then return render(s)
else error("Can't convert type '"..s.tag.."' to s string")
end
end }
Then the regenerate routine becomes:
local function assemble(document)
local s = {}
local box = document.first
while box do
s[#s+1] = tostring(box)
box = box.next
end
return tconcat(s,'\n')
end
The global or upvalue "render" is
local render = {
pdf2xml = lines,
page = lines,
document = assemble,
text = contents,
b = function(s) return '**'..contents(s)..'**' end,
i = function(s) return '*'..contents(s)..'*' end,
a = contents,
outline = "<outline>",
fontspec = "<fontspec>" }
with
contents = bind_concat""
lines = bind_concat"\n"
where
function bind_concat(sep)
--- table.concat with bound separator and `tostring` filter
return function(t)
local u={}
if type(t)=='string' then return t end
for k,v in ipairs(t) do u[k]=tostring(v) end
return tconcat(u,sep)
end
end
Note that "render" depends on the DTD. Writing a module that
can generate generate "render" from an arbitrary DTD was not
part of my purposes; writing a different "render" than converts
to say Markdown rather than plain text is, but not yet to the
level where I can share it.