lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


"Jerome Vuarand" <jerome.vuarand@ubisoft.com> writes:

> David Kastrup wrote::
>> Rici Lake <lua@ricilake.net> writes:
>>> and a few probes into the intern table all of which will quickly
>>> return "unequal". However, copying a million bytes (say) is time
>>> consuming.
>> 
>> I doubt that million byte strings are the norm.  And I doubt that
>> you'll _ever_ be in the situation that million-byte strings will be
>> used for comparison or indexing, and thus the whole hashing operation
>> is a waste of time.   
>
> Maybe Rici won't _ever_ do that, but in fact other people do, at least
> did I once. Here is the pseudo-code I used:
>
> local file = io.open("foo.txt", "r"):
> local input = file:read("*a")
> file:close()
>
> output = foo(input)
>
> -- Only rewrite file if it changed
> if input~=output then
>   file = io.open("foo.txt", "w")
>   file:write(output)
>   file:close()
> end

If Lua had not hashed the strings and instead directly compared them,
your code would have run quite faster.

Where hashing is a win is when
a) a comparison comparing equal is repeatedly made but not
remembered, and _both_ strings might be used for comparison again.
Note: even if you do

local file = io.open("foo.txt", "r"):
local input = file:read("*a")
file:close()

for i=1,200 do
   output = foo(input, i)
   if input~=output then
     file = io.open("foo.txt", "w")
     file:write(output)
     file:close()
   end
end

the cost of hashing does not pay off except in degenerate cases (like
when both strings have the same length and a long common prefix).  And
even then: if one suspected this usage pattern to be common, one could
just change the order of bytes in string comparison to something more
interleaved, basically starting with those bytes which would have
ended up in the hash, anyway.  But if you expect the file size to
remain the same and large portions of the file as well, the standard
hashing would have missed the changed portion quite likely, anyway.

Note that if a single string comparison turns out _true_ between a
hashed and an unhashed string, the hash code of the unhashed string is
immediately determined.

b) when the real benefit is not that some strings compare as equal,
but if a number of strings from a large mostly static set (like the
index set of a table) can be ruled as being unequal to the test string
in one go.  Table lookup, mostly.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum