lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


It was thus said that the Great Dibyendu Majumdar once stated:
> It seems from the majority of responses so far that JIT is not
> essential for Lua, or putting it another way, a majority of use cases
> can be satisfied without a JIT.
> 
> A natural follow on question is can the interpreter be made faster
> without resorting to hand-written assembly code or other esoteric
> optimisations?
> 
> Apart from using type annotations and introducing the numeric array
> types in Ravi - which when used do improve interpreter speed - I have
> also been experimenting with various other changes.
> 
> 1. If the Lua stack was fixed in size can that lead to more optimised
> code? We need to make sure that the compiler knows about this fact.
> 2. Can the table get/set be optimised for integer and string access -
> preferably inlined? I have been playing with the idea of using a
> precomputed 'hashmask' similar to LuaJIT to reduce the instructions
> needed to locate a hashed value.
> 3. What if we give hints to the compiler using the gcc/clang branch
> weights feature to help the optimizer generate better code for the
> more common scenarios? This is a trade-off as making common cases
> faster can penalize other cases.
> 4. Faster C function calls when the C function does not depend on Lua
> and takes 1-2 primitive arguments and returns a primitive argument.
> 5. If we use a memory allocator such as jemalloc will that help?
> 6. I have not considered computed gotos because I am not sure of their
> benefits especially for Ravi where the number of byte code
> instructions is larger than Lua. It may also lead to worse
> optimisation in other areas.
> 
> I have nothing to report yet as the results from my experiments are
> inconsistent across platforms, and I really haven't had the time to do
> this seriously.
> 
> I would welcome any thoughts on this topic.

  Profile, profile, profile.  

  As I've mentioned before, at work, I use Lua to process SIP messages. 
Part of this is sending requests to a backend process using a custom
protocol that uses a CRC check.  I recompiled the project with profiling
enabled, and reran the regression test (something on the order of 10,000
tests---it takes a few minutes to run).  The regression test generates quite
a bit of network traffic so there's quite a bit of CRCing going on.

  Well, I have the results (compiled with "-DNDEBUG -O3 -pg"):

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 14.54      1.41     1.41  6618161     0.00     0.00  luaS_newlstr
 13.71      2.74     1.33   780921     0.00     0.00  luaV_execute
 11.96      3.90     1.16   281588     0.00     0.00  match
  9.95      4.87     0.97 18458447     0.00     0.00  luaH_get
  5.88      5.44     0.57  6295728     0.00     0.00  luaD_precall
  3.81      5.81     0.37  2242275     0.00     0.00  sweeplist
  2.78      6.08     0.27  4067464     0.00     0.00  newkey
  1.96      6.27     0.19 13042290     0.00     0.00  luaV_gettable
  1.86      6.45     0.18   699804     0.00     0.00  propagatemark
  1.55      6.60     0.15 10018396     0.00     0.00  luaM_realloc_
  1.55      6.75     0.15  1247028     0.00     0.00  resize
  1.49      6.89     0.15   855027     0.00     0.00  luaH_getn
  1.44      7.03     0.14   303380     0.00     0.00  pushcapture
  1.13      7.14     0.11   722487     0.00     0.00  str_format
  1.03      7.24     0.10  3040708     0.00     0.00  luaV_settable
  1.03      7.34     0.10  2401886     0.00     0.00  lua_rawgeti

  Lots more, but it's rapidly turning into noise.  The CRC function shows up
as the 345th most called function (out of 435)---in other words, a function
I would expect to maybe be higher up on the list is *way* down there.

  But here---there's nothing that really jumps out to me as saying "optimize
here!" Sure, there's luaS_newlstr and luaV_execute, but what could be done
to speed those up?  Is it even worth it?  And so far, the code's performance
(which is in production and handling phone calls) hasn't been an issue (it's
compiled with "-DNDEBUG -O2" BTW; I ran with a default compile (no specific
optimizations and NDEBUG not defined) and got similar results).

  -spc