lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


> I recently moved from lua4 to lua5 and I've noticed huge loss in
> performance in table construction.


> With 3 such tables of 1700 entries each in one file the file takes close
> to 30 seconds to load on a PIII 800.

> Note that I tested this using the console interpreter that comes with
> the Lua5 distribution.  I simply loaded the tables by typing
> dofile("test.lua").  Test.lua does nothing more than create the tables.
> No processing is occurring.

Interesting problem. I had some trouble reproducing it.

I could not observe the phenomenon with either 1,700 entries or 17,000
entries, but I did manage to get it to happen on tables of 150,000 entries.
Possibly it would be useful to see your actual datafiles.

There is one significant difference between the console interpreters for
Lua 5 and Lua 4: the Lua 5 console interpreter is much cleverer with
interactive lines. This should not affect your test in the way you describe
it, but it would significantly degrade performance if the console
interpreter were started as follows:

  lua -i < test.lua

As far as I can see, this would effectively cause test.lua to be compiled
once per line, or 1700 times in your example (actually, it would be
compiled up to each line, which would have quadratic performance). That
would not be a very good idea.

However, typing

  dofile "test.lua"

will not trigger this issue.

With files of 150,000 points, I observed an apparently odd phenomenon: The
first time the file is loaded, it loads quite rapidly (approximately 5
seconds on my machine, which is slightly faster than yours.) The second
time, it takes 10 seconds. The third time it takes 15 seconds. By the
seventh time, it is up to 60 seconds. This struck me as interesting.

What I think is going on here is an interaction between Lua, the malloc
library, and the OS virtual memory system. Working on this assumption, I
configured the malloc library I use on that machine (not a particularly
good one, I don't think) to inform the OS virtual memory system about freed
pages. This completely solved the problem. With this configuration, the
file loaded in five seconds the first time, and 10 seconds every subsequent
time.

The explanation for the latter phenomenon is simple. The file consists of
the following:

points1 = { {1, 2, 3}, {4, 5, 6}, ... }
points2 = { {1, 2, 3}, {4, 5, 6}, ... }
points3 = { {1, 2, 3}, {4, 5, 6}, ... }

The first time it is run, it creates three globals whose values are
enormous tables. According to Lua, the total memory consumption at the end
of this operation is about 45 megabytes, or roughly 100 bytes per point;
this is what I would expect given the compilation options I was using. (Lua
4 uses much more memory, by the way). However, Lua does not know about
malloc overhead and the actual memory consumption was somewhat larger,
enough to force the machine to use swap space (it has 128 MB of RAM).

The second time it runs, it is going to replace the values of the globals.
However, the garbage collector will not free them until after the new
tables are created, since it is only after the new tables are created that
they are stored in the globals table. In fact, Lua's "double the threshold
every time" algorithm makes it highly likely that the garbage collection
will not take place until after the second run; this is because the garbage
collector runs prior to compiling the file the second time, so the
threshold is set to twice the usage at that point, and the usage at that
point is three tables so there is now room for six tables before a second
garbage collection. I hope that makes sense.

After a couple of runs, Lua settles into a routine:
  garbage collect
  compile points.lua
  create three new tables

This is actually pessimal, because the garbage collect is marking tables
which are just about to disappear, but only I know that. I *should* have
made the source file look like this:

points1, points2, points3 = nil
collectgarbage()
points1 = {...}

Should, that is, had I been writing real code rather than testing what was
going on. But it is a good general hint: if you are generating huge
objects, delete them after you are finished using them and then call
collectgarbage(). This will speed things up.

Anyway, back to the problem with gradually increasing run times:

In that configuration, there is a lot of ignorance. Lua does not know my
intentions, because I haven't told it. The malloc library doesn't know
Lua's intentions because there is no interface to the malloc library to let
it know what is going on. And the OS virtual memory system doesn't know the
malloc library's intentions because by default the malloc library doesn't
tell it (this is because the extra system calls slow things down except in
the case where excessive swapping is going on).

The key is what happens at the point where Lua collects garbage.

First, it marks as reachable the current values of points1, points2, and
points3. These are the ones created in the previous dofile("points.lua");
the ones I have not told Lua I am shortly about to delete. So it touches
every table in those collections, all 450,000 of them, by marking them. The
marking is done in table order, which is creation order.

Then it sweeps memory, which consists of the second-previous run's tables
(unmarked) and the previous run's tables (now marked). It touches the
latter tables again in order to unmark them, and then free()s the former
ones. (Sweeping is done in inverse creation order.) The malloc library does
not tell the VM system that the freed memory is garbage.

So from the VM system's perspective, the freed memory is not only in use
but also modified; it is modified because the last access to it was the
sweep step from the previous garbage collection, which modified the tables
by unmarking them. Lua has to access each of these pages in order to free
it; since the computer is at this point out of RAM, this forces some page
to be written to swap-storage and the page which is about to freed to be
read back in. The VM system would prefer to swap a page which has not been
modified, but there aren't any initially; once a page of freed objects has
all been freed, the page is no longer modified provided that the malloc
library doesn't modify it, but it has been recently used -- whether this is
more important than the fact that the page is unmodified depends on the VM
system, but it is important to remember that the VM system does not know
that the page will not be needed again; if it knew that, it would certainly
neither write it out nor attempt to read it back in later -- that is the
difference with letting the malloc library tell the OS what is going on.

Now, Lua is going to create new tables. It will do this by asking the
malloc library for free memory and the malloc library will hand over the
memory it just received. As mentioned above, this will cause the memory to
be swapped back in, forcing some other modified page (which actually will
never be needed again) to be swapped out.

I hope that is all vaguely clear.

So the essence of the situation is that three separate things are going on,
all of which contribute to forcing the machine into constant page-swapping:

First, programmer error. The programmer (i.e. me) didn't bother to release
big objects before garbage collection was likely to happen. That is
something only the programmer can know since Lua can't read the
programmer's mind.

Second, Lua's non-optimal (or even pessimal) interaction with VM systems.
Internal marks are a really bad idea in a virtual memory environment,
because the marks need to be read from garbage objects, forcing the garbage
object to be reloaded into memory even though it is garbage. An external
mark array interacts much better with VM, because garbage objects are never
touched by the garbage collector.

Third, the malloc library's poor interaction with the VM system, by not
telling it that pages are no longer useful, even though it (the malloc
library) knows that.

Now, why is this different between Lua 4 and Lua 5? I don't really know.
Lua 4 uses more memory to store these tables, so one would expect it to
create more swapping. However, it is possible that it's architecture causes
it to collect garbage at a different point in the cycle, in which case it
might reclaim storage sooner and actually interact better with the VM
system. I can't tell without more work because the extra memory requirement
exceeds the swap space on my machine, and I don't feel like rebuilding the
filesystem right now... sorry.

R.