A small motivating example. Was: small incompatibility

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: A small motivating example. Was: small incompatibility
From: Rici Lake <lua@...>
Date: Mon, 24 Oct 2005 16:52:45 -0500

Here's a simplified xml entity to latin1 converter. To make it real,one would want to fill in more named entities, and handle the caseswhere the numeric entity were >= 256, but it should be obvious where tofill those in:


do
  local tochar = string.char
  local tonumber = tonumber
  local function  convert(i, base)
    i = tonumber(i, base)
    if i then return tochar(i) end
  end
  local t = {
    ["&"] = {
      amp = "&",
      lt = "<",
      gt = ">",
    },
    ["&#"] = function(_, numref)
       if numref:match("[xX]") then
        return convert(numref:sub(2), 16)
      else
        return convert(numref, 10)
      end
    end
  }
  function ent2latin1(str)
    return str:gsub("(&#?)(%w+);", t)
  end
end

In order to test this function, I had to implement the str.gsubbehaviour, of course. The rules I used are:


1) nil or false: leave the original string intact
2) true:         delete the match (i.e. replace with "")
3) string:       replace %x as in current implementation
4) table:        lookup the first capture in the table, and
                 continue. (If another table is encountered,
                 use the next capture as the index.)
5) function:     call the function with all captures. If it
                 returns a boolean or nil, treat it as above;
                 if it returns a string, use the string as the
                 replacement without interpreting %x. Otherwise,
                 throw an error.

There is a slight inconsistency between the handling of returns fromfunctions and tables (which does, actually, complicate the code a bit),but it seems more compatible with current behaviour. In the context ofsome future lua library version, I'd favour eliminating the captureconversion in string replacements, and providing a library functionwhich did that instead (similar to the Lua version I posted earlier.)

When implementing this behaviour, I took the opportunity to implement aslight optimization: if no replacement occurs, the string is notcopied. My sense is that this would speed up things like:

   x:gsub("[&<>]", enttable)

in the common case where there are no special characters to entify. Infact, it might prove to be faster to just use:

   x:gsub(".", enttable)

which would avoid the problem of having to know which characters needto be entified. This effectively replaces the repetitive parsing of the[&<>] pattern with a table lookup.

I haven't actually done any benchmarking. I used the above function asthe test, since it seemed to include all the cases, but I haven't donethorough testing either:


> return ent2latin1("This is a simple string")
This is a simple string 0
> return ent2latin1("This is a &lt;less&gt; simple string")
This is a <less> simple string  2
> return ent2latin1("This is a &lt;less&gt; simple string &#42 &#x42")
This is a <less> simple string &#42 &#x42       2
> return ent2latin1("This is a &lt;less&gt; simple string &#42; &#x42;")
This is a <less> simple string * B      4

> return ent2latin1("This is a <less&gtg; simple string *B")

This is a <less&gtg; simple string * B  4

Anyway, the patch is at http://primero.ricilake.net/lstrlib.patch ifanyone wants to try it out. (No guarantees, and take it as publicdomain)

R.

Follow-Ups:
- Re: A small motivating example. Was: small incompatibility, Rici Lake

References:
- Re: small incompatibility, Roberto Ierusalimschy

Prev by Date: Re: A small motivating example. Was: small incompatibility
Next by Date: Re: trouble installing luaexpat on debian
Previous by thread: Re: small incompatibility
Next by thread: Re: A small motivating example. Was: small incompatibility
Index(es):
- Date
- Thread