[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: A small motivating example. Was: small incompatibility
 
- From: Rici Lake <lua@...>
 
- Date: Mon, 24 Oct 2005 16:52:45 -0500
 
Here's a simplified xml entity to latin1 converter. To make it real, 
one would want to fill in more named entities, and handle the cases 
where the numeric entity were >= 256, but it should be obvious where to 
fill those in:
do
  local tochar = string.char
  local tonumber = tonumber
  local function  convert(i, base)
    i = tonumber(i, base)
    if i then return tochar(i) end
  end
  local t = {
    ["&"] = {
      amp = "&",
      lt = "<",
      gt = ">",
    },
    ["&#"] = function(_, numref)
       if numref:match("[xX]") then
        return convert(numref:sub(2), 16)
      else
        return convert(numref, 10)
      end
    end
  }
  function ent2latin1(str)
    return str:gsub("(&#?)(%w+);", t)
  end
end
In order to test this function, I had to implement the str.gsub 
behaviour, of course. The rules I used are:
1) nil or false: leave the original string intact
2) true:         delete the match (i.e. replace with "")
3) string:       replace %x as in current implementation
4) table:        lookup the first capture in the table, and
                 continue. (If another table is encountered,
                 use the next capture as the index.)
5) function:     call the function with all captures. If it
                 returns a boolean or nil, treat it as above;
                 if it returns a string, use the string as the
                 replacement without interpreting %x. Otherwise,
                 throw an error.
There is a slight inconsistency between the handling of returns from 
functions and tables (which does, actually, complicate the code a bit), 
but it seems more compatible with current behaviour. In the context of 
some future lua library version, I'd favour eliminating the capture 
conversion in string replacements, and providing a library function 
which did that instead (similar to the Lua version I posted earlier.)
When implementing this behaviour, I took the opportunity to implement a 
slight optimization: if no replacement occurs, the string is not 
copied. My sense is that this would speed up things like:
   x:gsub("[&<>]", enttable)
in the common case where there are no special characters to entify. In 
fact, it might prove to be faster to just use:
   x:gsub(".", enttable)
which would avoid the problem of having to know which characters need 
to be entified. This effectively replaces the repetitive parsing of the 
[&<>] pattern with a table lookup.
I haven't actually done any benchmarking. I used the above function as 
the test, since it seemed to include all the cases, but I haven't done 
thorough testing either:
> return ent2latin1("This is a simple string")
This is a simple string 0
> return ent2latin1("This is a <less> simple string")
This is a <less> simple string  2
> return ent2latin1("This is a <less> simple string * B")
This is a <less> simple string * B       2
> return ent2latin1("This is a <less> simple string * B")
This is a <less> simple string * B      4
> return ent2latin1("This is a <less>g; simple string * 
B")
This is a <less>g; simple string * B  4
Anyway, the patch is at http://primero.ricilake.net/lstrlib.patch if 
anyone wants to try it out. (No guarantees, and take it as public 
domain)
R.