lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On 13/12/2012 08:21, Richter, Jörg wrote:
Strings are stored in a hash table but the string hash value is reduced
modulo the size of the hash table and so it helps to compare string hash
values for colliding strings before resorting to memcmp. See
	http://www.lua.org/source/5.2/lstring.c.html#luaS_newlstr

After reading this I applied the same optimization to my hash table
implementation.
Are you sure this is worthwhile? My unrepresentative performance tests are
somewhat slower (2-3%) than without this extra compare.

But I have no performance tests for Lua to double check.

Perhaps anyone can run their tests with "h == ts->tsv.hash &&" commented out.

    Jörg

So, here are some measures. One issue is: what to choose as supposedly representative input data? In particular, the proportion of equal strings (ones, which are actually stored in the pool, either because we mean to look them up or by chance) determines the number of petential memcmp's and thus the possible influence of checking the hash before. So, I decided to feed the pool with random strings of fix length=7 (so that len check does not pre-filter, but it's unreal) among an alphabet of varying number of letters. Which gives:

without hash check:

#strings:1000000  #letters:6  %equal:27
time  : 0.450

#strings:1000000  #letters:7  %equal:58
time  : 0.470

#strings:1000000  #letters:8  %equal:80
time  : 0.480

with hash check:

#strings:1000000  #letters:6  %equal:27
time  : 0.450

#strings:1000000  #letters:7  %equal:58
time  : 0.470

#strings:1000000  #letters:8  %equal:80
time  : 0.490

%equal is the proportion of equal strings. Numbers are fairly constant among different runs. Time is CPU time (just diff in clock()). So, about no influence at all in this test case. At best (or worst) I get ~ 2% in disfavour of checking the hash with high proportions of equal strings. Conclusion: in absence of more info (such as the reason why it's there in Lua), let it down.

However, I initially placed the hash in the string struct (as is done in Lua, a point I stepped on by chance and was the reason for my question) for another reason: i keep the hash anyway to avoid re-hashing when the pool grows, so why not put it in the string instead of on the cell struct (node)? If ever useful, it's there, while someone having access to the string but not to the cell would just not have it...

Denis