lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Here's a simplified xml entity to latin1 converter. To make it real, one would want to fill in more named entities, and handle the cases where the numeric entity were >= 256, but it should be obvious where to fill those in:

  local tochar = string.char
  local tonumber = tonumber
  local function  convert(i, base)
    i = tonumber(i, base)
    if i then return tochar(i) end
  local t = {
    ["&"] = {
      amp = "&",
      lt = "<",
      gt = ">",
    ["&#"] = function(_, numref)
       if numref:match("[xX]") then
        return convert(numref:sub(2), 16)
        return convert(numref, 10)
  function ent2latin1(str)
    return str:gsub("(&#?)(%w+);", t)

In order to test this function, I had to implement the str.gsub behaviour, of course. The rules I used are:

1) nil or false: leave the original string intact
2) true:         delete the match (i.e. replace with "")
3) string:       replace %x as in current implementation
4) table:        lookup the first capture in the table, and
                 continue. (If another table is encountered,
                 use the next capture as the index.)
5) function:     call the function with all captures. If it
                 returns a boolean or nil, treat it as above;
                 if it returns a string, use the string as the
                 replacement without interpreting %x. Otherwise,
                 throw an error.

There is a slight inconsistency between the handling of returns from functions and tables (which does, actually, complicate the code a bit), but it seems more compatible with current behaviour. In the context of some future lua library version, I'd favour eliminating the capture conversion in string replacements, and providing a library function which did that instead (similar to the Lua version I posted earlier.)

When implementing this behaviour, I took the opportunity to implement a slight optimization: if no replacement occurs, the string is not copied. My sense is that this would speed up things like:
   x:gsub("[&<>]", enttable)
in the common case where there are no special characters to entify. In fact, it might prove to be faster to just use:
   x:gsub(".", enttable)
which would avoid the problem of having to know which characters need to be entified. This effectively replaces the repetitive parsing of the [&<>] pattern with a table lookup.

I haven't actually done any benchmarking. I used the above function as the test, since it seemed to include all the cases, but I haven't done thorough testing either:

> return ent2latin1("This is a simple string")
This is a simple string 0
> return ent2latin1("This is a &lt;less&gt; simple string")
This is a <less> simple string  2
> return ent2latin1("This is a &lt;less&gt; simple string &#42 &#x42")
This is a <less> simple string &#42 &#x42       2
> return ent2latin1("This is a &lt;less&gt; simple string &#42; &#x42;")
This is a <less> simple string * B      4
> return ent2latin1("This is a &lt;less&gtg; simple string &#42; &#x42;")
This is a <less&gtg; simple string * B  4

Anyway, the patch is at if anyone wants to try it out. (No guarantees, and take it as public domain)
