Suppose I have the Lua interpreter compiled just to work with a single application for a target hardware system, this means the interpreter could be tuned to work better with that application and to the system doing things like tuning the compiler C flags or Lua C defines. I've been trying to do this and managed to get good results, so I would like to share the ones I did and to ask if anyone has more ideas.
1. Setting a baseline.
My baseline is the Lua 5.4 interpreter using just "-O2" optimization in CFLAGS,
compiled with GCC 10, targeting x86_64 on a Linux system, no custom Lua defines,
standard garbage collector configuration, then I calculate the average runtime of 10 runs of my application.
Results: 1.071 seconds.
2. Compile just "onelua.c".
By compiling Lua in a single big C source allows the compiler perform more optimizations.
Results: 1.028 seconds. (-4% from baseline)
3. Add `-fno-plt -fno-stack-protector -flto` CFLAGS.
These were some optimization flags that I thought that made sense to try out in Lua.
Results: 1.004 seconds. (-6.3% from baseline)
4. Use a custom memory allocator.
There are many memory allocators out there such as tcmalloc, jemalloc, mimalloc,
you usually can just link them and replace system's malloc. From the ones I tried
lua worked best with mimalloc, so adding the flag '-lmimalloc'
Results: 0.864 seconds. (-19.3% from baseline)
PS: That's interesting, mimalloc improved by a good margin.
5. Make the GC less aggressive.
My application creates a large number of tables (it is a compiler and create hundreds of AST nodes), so the GC is working too early all the time, I use the following to make
the GC less aggressive:
Results: 0.777 seconds. (-27.7% from baseline)
PS: I know this consumes much more memory, but I don't care in my use case.
Other things I've tried and made almost no difference:
* Adding -march=native CFLAGS.
* Trying different values for LUAI_MAXSHORTLEN, STRCACHE_N, STRCACHE_M.
Note: The optimizations were stacked, that means, 5 is really 1+2+3+4+5.
Does anyone have other ideas on how to tune the Lua interpreter to squeeze more performance?