lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

On 18/05/2015 10:41, Dirk Laurie wrote:
2015-05-18 8:22 GMT+02:00 Dirk Laurie <>:
2015-05-18 2:24 GMT+02:00 Tim Channon <>:
There are many XML decoding libraries of varying degrees of capability and
portability. The focus of these is decode.

Discussion of regenerating identical XML from the decode seems to get lost
and particularly when the user has no idea of the XML meaning, simply wants
to intercept textually known entities within a context.

I found it quite easy to regenerate identical XML from the output of Roberto's
parser on the Wiki, but in a very specialized context: the XML
files were those produced by `pdftohtml -xml`, for which the DTD is
self-contained and a mere 49 lines long. A far cry from SVG's over 300 lines
mainly pulling in other files, I'll admit.

Here is what I did: I tweaked Roberto's code to provide a metatable
for every table-valued item.

element_mt = { __tostring =
       if type(s)=='string' then return s end
       local render = render[s.tag]
       if type(render)=='string' then return render
       elseif type(render)=='function' then return render(s)
       else error("Can't convert type '"..s.tag.."' to s string")
    end }

Then the regenerate routine becomes:

local function assemble(document)
    local s = {}
    local box = document.first
    while box do
       s[#s+1] = tostring(box)
       box =
    return tconcat(s,'\n')

The global or upvalue "render" is

local render = {
    pdf2xml = lines,
    page = lines,
    document = assemble,
    text = contents,
    b = function(s) return '**'..contents(s)..'**' end,
    i = function(s) return '*'..contents(s)..'*' end,
    a = contents,
    outline = "<outline>",
    fontspec = "<fontspec>" }


contents = bind_concat""
lines = bind_concat"\n"


function bind_concat(sep)
--- table.concat with bound separator and `tostring` filter
    return function(t)
       local u={}
       if type(t)=='string' then return t end
       for k,v in ipairs(t) do u[k]=tostring(v) end
       return tconcat(u,sep)

Note that "render" depends on the DTD. Writing a module that
can generate generate "render" from an arbitrary DTD was not
part of my purposes; writing a different "render" than converts
to say Markdown rather than plain text is, but not yet to the
level where I can share it.

This seems to confirm the problem is not trivial, one of many similar computing situations where the intent and the tools do not match well.

Given what seemed intractable and given limited resources I have turned to a direct ad hoc approach on what is dubious XML (*). Perhaps critically this allows the maintaining of strict (stable) sequence. I am able to identify section and data on-the-fly one pass writing back out as the input flows past. The decoded XML could be stored. This is not quite accurate, one elusive bug remains but nevertheless several SVG display tools accept a rewritten file without complaint, validation produces a warning, not error. Recoloring the subset of polygons to blue produces blue. Relating the polygons to the original co-ordinates, ah well, what fun, are translated. Resolvable.

I don't like ad hoc when formal ought to be easy.

I use Roberto's parser elsewhere so your writing is welcome.

I'll continue dabbling and you've given me some things to try.

Thank you Dirk.

* not allowed self closing tags are not illegal by XML rules and here there is erratic usage of named or self closing for the same tags within one file. eg. both <text> </text> and <text> />