lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


I will give here a little trick to save memory and time when manipulating
strings.

This trick is probably obvious to seasoned Lua programmers, but I believe it
can be useful for beginners to 
give some facts about Lua strings, as they may not be obvious from the
manual. And in any case, it is 
interesting to restate them.
Note I wrote the given code in Lua 5.0 syntax, but the information is valid
for Lua 4.0 as well.

Some time ago, I wrote a little routine to convert French accented
characters to HTML entities, for use in a 
demo Lua CGI program.

I give it below. No rocket science, simple (simplistic?) and obvious,
nothing worth mentioning here, if not 
for showing how to improve it...

>>>
-- Windows to HTML
local entities =
{
--	['&'] = "&",
	['<'] = "&lt;",
	['>'] = "&gt;",
	-- French entities (the most common ones)
	['?'] = "&agrave;",
	['?'] = "&acirc;",
	['È'] = "&eacute;",
	['Ë'] = "&egrave;",
	['Í'] = "&ecirc;",
	['Î'] = "&euml;",
	['Ó'] = "&icirc;",
	['Ô'] = "&iuml;",
	['Ù'] = "&ocirc;",
	['?'] = "&ouml;",
	['<breve>'] = "&ugrave;",
	['?'] = "&ucirc;",
-- ... shortened
}

function EncodeEntities1(toEncode)
	if toEncode == nil or type(toEncode) ~= "string" then
		return ''
	end

	local EncodeHighAscii = function (char)
		local code = string.byte(char)
		if code > 127 then
			return string.format("&#%d;", code)
		else
			return char
		end
	end

	local encodedString = toEncode
	-- First encode '&' char, to avoid re-encoding already encoded chars
	encodedString = string.gsub(encodedString, '&', "&amp;")
	-- Encode known entities
	for char, entity in entities do
		encodedString = string.gsub(encodedString, char, entity)
	end
	-- Encode unknown high Ascii characters to numerical entities
	encodedString = string.gsub(encodedString, '(.)', EncodeHighAscii)
	return encodedString
end
<<<

I used a similar routine to convert Windows accents to Mac ones.

>>>
-- PC (Windows) to Mac
local chars =
{
	['?'] = 'à',	-- &agrave;
	['?'] = 'â',	-- &acirc;
	['Á'] = 'ç',	-- &ccedil;
	['È'] = 'é',	-- &eacute;
	['Ë'] = 'è',	-- &egrave;
	['Í'] = 'ê',	-- &ecirc;
	['Î'] = 'ë',	-- &euml;
	['Ó'] = 'î',	-- &icirc;
	['Ô'] = 'ï',	-- &iuml;
	['Ù'] = 'ô',	-- &ocirc;
	['?'] = 'ö',	-- &ouml;
	['<breve>'] = 'ù',	-- &ugrave;
	['?'] = 'û',	-- &ucirc;
-- ... shortened
}

function EncodeChars(toEncode)
	if toEncode == nil or type(toEncode) ~= "string" then
		return ''
	end

	local encodedString = toEncode
	-- Encode known chars
	for charPC, charMac in chars do
		encodedString = string.gsub(encodedString, charPC, charMac)
	end
	encodedString = string.gsub(encodedString, '\n', '\r')	-- Read in text
mode, EOLs are only \n
	return encodedString
end
<<<

I shortened the tables, they will be messed up anyway, because I send this
article from the Mac of my 
wife ;-) That's why I made the second routine, BTW...

Now, if you know some little things about the way Lua manages strings, and
think a bit about these 
routines, you will see they are inefficient.

In Lua, strings are immutables. That means that if you change even only one
byte from a string (using 
string.gsub, or concatenating another string, the only ways to alter a
string in standard Lua, beside 
string.upper and string.lower), Lua will make a copy of the string and do
the change. If the new string is 
set to the same variable, the old string will be left without reference and
the garbage collector will take 
care of it.
That is because Lua is making a hash value out of any string, and checks if
it already exists in its string 
table. If it does, it just makes a new reference to that string, avoiding
duplicates. And this also quicken 
string comparisons. But that means that Lua must compute this hash value for
each and every string, even 
"intermediate" ones, and cannot change an existing string, because it can
have multiple references.

The downside is that if you make multiple distinct changes to a string, a
lot of copies will be generated, 
eating memory and taking time to collect.

The solution, when possible, is to make as many changes as possible in one
call. Ie. instead of using 
multiple gsubs, try to make only one. And instead of looping on ..
concatenation, use table.concat if 
possible. That is because in these calls, even very complex, Lua will
manipulate the string in C buffer, 
without rehashing. Only the final, resulting string will become a Lua
string.

So, a way to improve this routine is to collapse all calls to gsub into one.
For this purpose, I build a regular 
expression allowing to search all accented characters at once:

>>>
function EncodeEntities2(toEncode)
	if toEncode == nil or type(toEncode) ~= "string" then
		return ''
	end

	local EncodeToEntities = function (char)
		return entities[char] or char
	end

	local EncodeHighAscii = function (char)
		local code = string.byte(char)
		if code > 127 then
			return string.format("&#%d;", code)
		else
			return char
		end
	end

	local encodedString = toEncode
	local encodingString = "(["
	entities['&'] = "&amp;" -- Not needed if directly put in the table
	-- Add all characters to encode to the encodingString
	for char, entity in entities do
		encodingString = encodingString .. char
	end
	encodingString = encodingString .. "])"
	-- Encode known characters to entities
	encodedString = string.gsub(encodedString, encodingString,
EncodeToEntities)
	-- Encode unknown high Ascii characters to numerical entities
	encodedString = string.gsub(encodedString, '(.)', EncodeHighAscii)
	return encodedString
end
<<<

Note that this method cannot be used in the reverse conversion (HTML to
Windows, but it can be used for 
Mac to Windows) because the entities have several characters, and current
Lua regular expressions don't 
allow alternative strings, only alternative characters. One way to bypass
this is to use a third party RE 
library, another way may be to search "&([^;]+);" and see if the captured
string is in the table.

There is still a little inefficiency in the second algorithm as I do
multiple concatenations. But table.concat 
doesn't work here, unless I make a secondary table. Since the table is quite
small, I don't think it is worth 
it.

There is another inefficiency, more annoying, because there is no way to
search high Ascii characters in a 
regular expression. So I have to make another scan, checking each character
of the string. It is quite slow 
and generates a copy of the string.

If that's so, why not check every character and see if it has to be
converted? This gives the ultimate 
routine:

>>>
function EncodeEntities3(toEncode)
	if toEncode == nil or type(toEncode) ~= "string" then
		return ''
	end

	local EncodeToEntities = function (char)
		local entity = entities[char]
		if entity == nil then
			local code = string.byte(char)
			if code > 127 then
				entity = string.format("&#%d;", code)
			end
		end
		return entity or char
	end

	entities['&'] = "&amp;"
	encodedString = string.gsub(toEncode, '(.)', EncodeToEntities)
	return encodedString
end
<<<

It is quite powerful, as it shows that complex treatments can be done on the
conversion function of the 
gsub.

To test these routines, I put them all in one file (that's the reason why I
had to add the '&' entry instead of 
putting it directly in the table). And I wrote the following code:

>>>
print("Initial GC: ", gcinfo())
dofile"3EncodeEntities.lua"

local stdin, stdout
if not io.input("BigFile.txt") then
	return -- Actually Lua will choke on input
end
stdin = io.read("*a")
io.input()	-- Restore default reading

-- Clean up
collectgarbage()
-- Set limit high enough to avoid having GC during algorithm.
-- This allows to check the maximum memory use.
collectgarbage(30000)
print("GC after reading file: ", gcinfo())
win32.StartFineClock()
if EncodeEntities1 then
	stdout = EncodeEntities1(stdin)
--	io.write(stdout)
end
print("First algo running time: ", (win32.GetFineClockTime()))
print("GC after first algo: ", gcinfo())

-- Clean up
stdout = nil
collectgarbage()
-- Reset the high limit
collectgarbage(30000)

print("Garbage collected: ", gcinfo())
win32.StartFineClock()
if EncodeEntities2 then
	stdout = EncodeEntities2(stdin)
-- 	io.write(stdout)
end
print("Second algo running time: ", (win32.GetFineClockTime()))
print("GC after second algo: ", gcinfo())

-- Clean up
stdout = nil
collectgarbage()
-- Reset the high limit
collectgarbage(30000)

print("Garbage collected: ", gcinfo())
win32.StartFineClock()
if EncodeEntities3 then
	stdout = EncodeEntities3(stdin)
-- 	io.write(stdout)
end
print("Third algo running time: ", (win32.GetFineClockTime()))
print("GC after third algo: ", gcinfo())

-- Final clean up
stdout = nil
stdin = nil
collectgarbage()
print("Final GC: ", gcinfo())
<<<

BigFile.txt is a repetition of French text, with accents and some <, > and
&, plus some complete sets of 
high Ascii characters, up to 100,000 chars.

The win32 table is from my own Win32 library, you can replace the calls to
XxxFineClockTime by calls to 
clock(). My functions are a bit more precise, using the Pentium performance
counter.

To measure the memory use, I set the limit to garbage collection quite high,
so it won't kick in during the 
encoding, and gcinfo will return accurate value. I collect garbage between
the various algorithms, to 
restart on fresh ground.

The results are (on my slow PII 300 with Win98):
C:\Dev\LuaPhiLhoSoft> LuaWin32 Test3EncodeEntities.lua
Initial GC:     28      56
GC after reading file:  156     30000
First algo running time:        2.8918805209608
GC after first algo:    21230   30000
Garbage collected:      226     30000
Second algo running time:       0.9387368209323
GC after second algo:   1440    30000
Garbage collected:      210     30000
Third algo running time:        0.62521078127357
GC after third algo:    828     30000
Final GC:       232     465

We clearly see that we have an improvement of both speed and memory use.
Note that the improvement of the third algorithm isn't absolute. If we
remove the high Ascii encoding (eg. 
in the PC to Mac conversion, putting the EOL conversion in the table), the
second algorithm wins, because 
it doesn't have to run the Lua function on each and every character of the
file:

C:\Dev\LuaPhiLhoSoft> LuaWin32 Test3EncodeEntities.lua
Initial GC:     28      56
GC after reading file:  156     30000
First algo running time:        2.2509822491158
GC after first algo:    20666   30000
Garbage collected:      224     30000
Second algo running time:       0.31407331668315
GC after second algo:   814     30000
Garbage collected:      209     30000
Third algo running time:        0.35992054845036
GC after third algo:    893     30000
Final GC:       228     457

Note that if we keep the GC limit to default value, the memory consumption
isn't as dramatic as shown, as 
memory is collected during the loop of gsub calls. And we don't see
signifiant increase of time, because 
GC is quite fast here.

This article is a bit lengthly, but I hope it will shed some light on the
internals of Lua and how to make the 
best use of this wonderful language.

I will probably put this article on the wiki, if I find time, unless
somebody bored and with some free time 
put it there before, possibly after translating it to good English...
Remarks, corrections, clarifications are welcome.

-- 
--=#=--=#=--=#=--=#=--=#=--=#=--=#=--=#=--=#=--
Philippe Lhoste (Paris -- France)
Professional programmer and amateur artist
http://jove.prohosting.com/~philho/
--=#=--=#=--=#=--=#=--=#=--=#=--=#=--=#=--=#=--

+++ GMX - Mail, Messaging & more  http://www.gmx.net +++
Bitte lächeln! Fotogalerie online mit GMX ohne eigene Homepage!