lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

2015-05-18 8:22 GMT+02:00 Dirk Laurie <>:
> 2015-05-18 2:24 GMT+02:00 Tim Channon <>:
>> There are many XML decoding libraries of varying degrees of capability and
>> portability. The focus of these is decode.
>> Discussion of regenerating identical XML from the decode seems to get lost
>> and particularly when the user has no idea of the XML meaning, simply wants
>> to intercept textually known entities within a context.

I found it quite easy to regenerate identical XML from the output of Roberto's
parser on the Wiki, but in a very specialized context: the XML
files were those produced by `pdftohtml -xml`, for which the DTD is
self-contained and a mere 49 lines long. A far cry from SVG's over 300 lines
mainly pulling in other files, I'll admit.

Here is what I did: I tweaked Roberto's code to provide a metatable
for every table-valued item.

element_mt = { __tostring =
      if type(s)=='string' then return s end
      local render = render[s.tag]
      if type(render)=='string' then return render
      elseif type(render)=='function' then return render(s)
      else error("Can't convert type '"..s.tag.."' to s string")
   end }

Then the regenerate routine becomes:

local function assemble(document)
   local s = {}
   local box = document.first
   while box do
      s[#s+1] = tostring(box)
      box =
   return tconcat(s,'\n')

The global or upvalue "render" is

local render = {
   pdf2xml = lines,
   page = lines,
   document = assemble,
   text = contents,
   b = function(s) return '**'..contents(s)..'**' end,
   i = function(s) return '*'..contents(s)..'*' end,
   a = contents,
   outline = "<outline>",
   fontspec = "<fontspec>" }


contents = bind_concat""
lines = bind_concat"\n"


function bind_concat(sep)
--- table.concat with bound separator and `tostring` filter
   return function(t)
      local u={}
      if type(t)=='string' then return t end
      for k,v in ipairs(t) do u[k]=tostring(v) end
      return tconcat(u,sep)

Note that "render" depends on the DTD. Writing a module that
can generate generate "render" from an arbitrary DTD was not
part of my purposes; writing a different "render" than converts
to say Markdown rather than plain text is, but not yet to the
level where I can share it.