lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Based on my comment in the thread on nil as a table key, I did a proof-of-concept implementation of the suggested iteration protocol.

The benchmarking I did shows speedups of 6-10%, but one test seems to indicate a speedup of 33%.

The diff is not very big; it can be viewed here: <https://trac.bulix.org/projects/rici/changeset/84> or downloaded as a diff <https://trac.bulix.org/projects/rici/changeset/84?format=diff>. The latter URL includes the benchmarking program I used (test/benchmark.lua).

It does not include my patch for yieldable iterators, but the equivalent patch would be much simpler for the modified iteration protocol since the termination test could be implemented with an unmodified OP_TEST. I didn't spend a lot of time thinking about optimizations or things that could be removed. All I did was change OP_TFORLOOP to use the new protocol, and modify pairs(), string.gmatch() and io.lines(). (ipairs() needs to be modified as well, but I didn't do that. Oops.)

The implementation differs slightly from the description given in my earlier message, in that it stops on any false return from the iterator, rather than only on a nil return; the change seemed feasible with state separated from iteration variables, but it may not be the best thing.

Here are the benchmark results, compiled with -O2 on gcc 3.3.3 running on FreeBSD:

1) Simple for loop; iterates over 10 numeric and 10 non-numeric keys in the same table. Presumably the speedup is simply from not having to copy the control variable in OP_TFORLOOP.

beta-5.1: 7.02 seconds (1e6 repetitions)
modified: 6.59 seconds

saving: 6%

2) string.gmatch. Uses the standard idiom str:gmatch("%S+") to iterate over a string with 12 short words. Here the speedup is rather more marked, since we can avoid creating a closure on each loop.

beta-5.1: 9.02 seconds (1e6 repetitions)
modified: 8.18 seconds

saving: 9.3%

Note 1: Of course, you can't actually implement gmatch() without creating a closure even with the suggested iteration protocol. However, you don't have to create a closure every time; you only need one for each pattern. The sample implementation provides the API string.gmatcher(pattern) which returns a closure; this can be provided to string.gmatch instead of the pattern argument. (It also allows you to specify a starting position in the string in string.gmatch()). I didn't spend a lot of time thinking about the API, but the results can be inspected (see below).

Note 2: The above timings are done with os.clock(), but I also timed the complete benchmark of a million iterations of each of the above tests. The results:

beta-5.1: 16.12 seconds (user)
modified: 14.84 seconds

saving: 7.9%

3) A simple (ish) implementation of insertion-ordered multiple-value tables, such as the data structure you might want to use for storing HTTP headers or LDAP entries. This is a fairly complex piece of code, so it ought to be some sort of indication of the impact on a real program. The implementation differs only in the iterator; the 5.1-beta iterator has to create a closure for each iteration, while the modified-lua iterator does not; a single iteration-state (an integer) is sufficient to record the iteration state.

The test:

a) converts an LDIF record (taken from the RFC) into an insert-ordered table. The LDIF record has 12 entries and nine distinct keys.
b) deletes all entries (3 of them) with a key.
c) adds two new entries with the same key
d) makes a shallow copy (which iterates over all key/value pairs)
e) iterates over all values for each of three keys (a total of six key/value pairs)

The results are a bit odd, so I'll just report them. There are two measures: the first is simply the time reported by os.clock(); the second is the total time to run the benchmark reported by /usr/bin/time (which therefore includes parsing, test setup, etc., but most importantly closing the lua_State at the end of the run, which garbage collects everything. The sample was executed 100,000 times:

beta-5.1:  (os.clock) 27.78      (/usr/bin/time)  37.47
modified:             26.32                       25.12  (!)

saving:                5.2%                       33%

There is a highly noticeable gc pause at the end of the beta-5.1 run, so I presume that this is the result of garbage collecting all the closures which were created. On the other hand, each iteration of the test allocates about 10 tables, which you would think would add up quite a bit, too; furthermore, the memory usage reported by the rusage structure is not significantly different between the two tests. My guess is that closures are small but gc-intensive.