lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


2014-10-20 10:12 GMT+02:00 steve donovan <steve.j.donovan@gmail.com>:
> On Mon, Oct 20, 2014 at 9:39 AM, Dirk Laurie <dirk.laurie@gmail.com> wrote:
>> document to a Lua table, do things to it, and convert
>> back to XML. For this last step, it seems not very hard
>> to roll one's own. But surely it must have been already
>> done somewhere else?
>
> Penlight's xml module has a pretty-printer that speaks the same object
> module as luaexpat. Cheerfully stolen from Prosody :)

Thanks, xml.tostring works. Is the choice of string delimiter
configurable?

xml.basic_parse doesn't [1], but I can always use lom.parse.
It does not parse <! and <? though, is xml.basic_parse supposed to?

[1] On the attached XML input, I get
/usr/local/share/lua/5.3/pl/xml.lua:619: attempt to perform arithmetic
on a nil value (local 'j')
stack traceback:
    /usr/local/share/lua/5.3/pl/xml.lua:619: in function 'basic_parse'
    stdin:1: in main chunk
    [C]: in ?
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">

<pdf2xml producer="poppler" version="0.24.5">
<page number="1" position="absolute" top="0" left="0" height="1188" width="918">
	<fontspec id="0" size="19" family="Times" color="#000000"/>
	<fontspec id="1" size="12" family="Times" color="#000000"/>
	<fontspec id="2" size="12" family="Times" color="#000000"/>
	<fontspec id="3" size="15" family="Times" color="#000000"/>
	<fontspec id="4" size="12" family="Times" color="#000000"/>
<text top="187" left="201" width="201" height="19" font="0"><b>PDF to Markdown</b></text>
<text top="234" left="201" width="451" height="13" font="1">This tool converts a <i>suitable </i>PDF file to a <i>reasonable </i>Markdown file.</text>
<text top="282" left="201" width="122" height="16" font="3"><b>Requirements</b></text>
<text top="320" left="223" width="62" height="15" font="1">• pandoc</text>
<text top="339" left="223" width="384" height="15" font="1">• poppler-utils (actually only the program pdftohtml).</text>
<text top="389" left="201" width="69" height="16" font="3"><b>Method</b></text>
<text top="428" left="201" width="516" height="13" font="1">The common factor is <b>XML</b>. Pandoc can read a subset of Docbook, which is an</text>
<text top="446" left="201" width="516" height="13" font="1">XML format, and PDFtoHTML can write XML. So it’s a question of translating</text>
<text top="464" left="201" width="204" height="13" font="1">one dialect of XML to another.</text>
<text top="1044" left="455" width="7" height="13" font="1">1</text>
</page>
<outline>
<item page="1">PDF to Markdown</item>
<outline>
<item page="1">Requirements</item>
<item page="1">Method</item>
</outline>
</outline>
</pdf2xml>