[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: A small motivating example. Was: small incompatibility
- From: Rici Lake <lua@...>
- Date: Mon, 24 Oct 2005 16:52:45 -0500
Here's a simplified xml entity to latin1 converter. To make it real,
one would want to fill in more named entities, and handle the cases
where the numeric entity were >= 256, but it should be obvious where to
fill those in:
do
local tochar = string.char
local tonumber = tonumber
local function convert(i, base)
i = tonumber(i, base)
if i then return tochar(i) end
end
local t = {
["&"] = {
amp = "&",
lt = "<",
gt = ">",
},
["&#"] = function(_, numref)
if numref:match("[xX]") then
return convert(numref:sub(2), 16)
else
return convert(numref, 10)
end
end
}
function ent2latin1(str)
return str:gsub("(&#?)(%w+);", t)
end
end
In order to test this function, I had to implement the str.gsub
behaviour, of course. The rules I used are:
1) nil or false: leave the original string intact
2) true: delete the match (i.e. replace with "")
3) string: replace %x as in current implementation
4) table: lookup the first capture in the table, and
continue. (If another table is encountered,
use the next capture as the index.)
5) function: call the function with all captures. If it
returns a boolean or nil, treat it as above;
if it returns a string, use the string as the
replacement without interpreting %x. Otherwise,
throw an error.
There is a slight inconsistency between the handling of returns from
functions and tables (which does, actually, complicate the code a bit),
but it seems more compatible with current behaviour. In the context of
some future lua library version, I'd favour eliminating the capture
conversion in string replacements, and providing a library function
which did that instead (similar to the Lua version I posted earlier.)
When implementing this behaviour, I took the opportunity to implement a
slight optimization: if no replacement occurs, the string is not
copied. My sense is that this would speed up things like:
x:gsub("[&<>]", enttable)
in the common case where there are no special characters to entify. In
fact, it might prove to be faster to just use:
x:gsub(".", enttable)
which would avoid the problem of having to know which characters need
to be entified. This effectively replaces the repetitive parsing of the
[&<>] pattern with a table lookup.
I haven't actually done any benchmarking. I used the above function as
the test, since it seemed to include all the cases, but I haven't done
thorough testing either:
> return ent2latin1("This is a simple string")
This is a simple string 0
> return ent2latin1("This is a <less> simple string")
This is a <less> simple string 2
> return ent2latin1("This is a <less> simple string * B")
This is a <less> simple string * B 2
> return ent2latin1("This is a <less> simple string * B")
This is a <less> simple string * B 4
> return ent2latin1("This is a <less>g; simple string *
B")
This is a <less>g; simple string * B 4
Anyway, the patch is at http://primero.ricilake.net/lstrlib.patch if
anyone wants to try it out. (No guarantees, and take it as public
domain)
R.