lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]



On 21-Mar-05, at 7:54 AM, PA wrote:

Alternatively, somebody, somewhere, somehow must have written this a dozen time already. Is there not a little code sample somewhere to show how to decode an XML string in Lua? Sigh...

Probably. Although the issues are subtle. I don't address any of them here; this is simply a working reimplementation of the same transformation.

do
  local ents = {
    lt = '<',
    gt = '>',
    amp = '&',
    quot = '"',
    apos = "'"
  }

  local maxutf8 = tonumber('10FFFF', 16)

  local function entity2char(hash, str)
    if hash == '#' then
      -- turn hex into c-style hex
      local utfcode = tonumber((string.gsub(str, '^x', '0x')))
      if utfcode and utfcode < 256 then
         return string.char(utfcode)
      end
    elseif ents[str] then
      return ents[str]
    end
    return '&'..hash..str..';'
  end

  function decode(str)
    return str and string.gsub(str, '&(#?)(%w+);', entity2char)
  end
end

--- some tests
=decode '&amp;apos; is how you write &apos;'
=decode '&#38;amp; is an ampersand.'
=decode '&quot;I said &apos;Stop right there!&apos; &amp; I &lt;strong&gt;meant it!&lt;/strong&gt;&quot; the webmaster shouted, htmlifying instinctively'
-- Codes and noncodes
=decode 'Some invalid numeric escapes include &#7b2;, &#x24g;'
=decode "Take out the &garbage;! Don't leave it for ma&#xf1;ana! The sooner the &#x3b2;!"
-- What was I saying about iso-8859-1?
=decode 'ma&#xf1;ana or ma&#xc3;&#xb1;ana?'

-->

> =decode '&amp;apos; is how you write &apos;'
&apos; is how you write '
> =decode '&#38;amp; is an ampersand.'
&amp; is an ampersand.
> =decode '&quot;I said &apos;Stop right there!&apos; &amp; I &lt;strong&gt;meant it!&lt;/strong&gt;&quot; the webmaster shouted, htmlifying instinctively' "I said 'Stop right there!' & I <strong>meant it!</strong>" the webmaster shouted, htmlifying instinctively
> -- Codes and noncodes
> =decode 'Some invalid numeric escapes include &#7b2;, &#x24g;'
Some invalid numeric escapes include &#7b2;, &#x24g;
> =decode "Take out the &garbage;! Don't leave it for ma&#xf1;ana! The sooner the &#x3b2;!" Take out the &garbage;! Don't leave it for ma?ana! The sooner the &#x3b2;!
> -- What was I saying about iso-8859-1?
> =decode 'ma&#xf1;ana or ma&#xc3;&#xb1;ana?'
ma?ana or mañana?