Re: Squeezing more performance from the Lua interpreter

Ok, I've enabled those flags just in the luaV_execute (I looked into ravi sources)

```

#if defined(__GNUC__) && !defined(__clang__)
__attribute((optimize("no-crossjumping,no-gcse")))
#endif

void luaV_execute (lua_State *L, CallInfo *ci) {

```

Numbers are still good!

About the allocators, I've tried dlmalloc 2.8.6 and compared with the others,

I did not use the MSPACE interface though, I've used the usual malloc/realloc/free interface.

The following numbers are under the same conditions as the original post

(1+2+3+4+5 optimizations) and also enabled "-fno-crossjumping -fno-gcse." in lvm.c:

dlmalloc 0.820s (-23.4% from baseline)
rpmalloc 0.760s (-29.0% from baseline)
mimalloc 0.747s (-30.3% from baseline)

Notice mimalloc went from 0.777 seconds. (-27.7%) to 0.747s (-30.3%) due to the "-fno-crossjumping -fno-gcse." flags in lvm.c.

I will stick with rpmalloc because it performs better in my use case and is smaller, faster and can still bundle in a single C file,

while mimalloc is a burden as dependency.

Perhaps I could optimize rpmalloc later to remove atomic operations, thread locals and the numbers could be better.

Regards,

Eduardo Bart

Em sex., 21 de ago. de 2020 às 13:23, Dibyendu Majumdar <mobile@majumdar.org.uk> escreveu:

On Fri, 21 Aug 2020 at 17:11, Eduardo Bart <edub4rt@gmail.com> wrote:
>
> Thanks for the tip!
>
> Seems like I got another ~4% improvement with those flags! (although I've enabled globally)
> What are they doing in the interpreter? Better jumping when executing the lua instructions?

I wouldn't enable globally as it disables some optimizations -
specially global common subexpression - which is likely bad for the
code.
The only code that needs these options is the VM which uses computed
goto - I think in 5.4 this is enabled for gcc.
If you see the generated assembly output - without these flags the
code may not be using computed goto at all.

>
> Custom allocator also benefits me on Linux, the results I've shown were all on linux.
> I've managed to bundle my Lua interpreter with this allocator https://github.com/mjansson/rpmalloc
> Seems to work fine and it's much more simple to use (a single C file) and performs the same as mimalloc,
> and much faster than my system's standard malloc.
> I think it could be even faster if the thread safety was removed from that allocator.
> All good allocators in C that I find out there are thread safe and I don't see much need in the Lua case.
>
> I've also experimented porting the LuaJIT's allocator to Lua 5.4 (was quite easy to do),
> and it performed worse than mimalloc or rpmalloc for my use case.

You can use dlmalloc - which was used by LuaJIT (with some mods).
I use it in Ravi.
It has an arena mode (ONLY_MSPACES ) - and you can disable all locking
too. So that makes it suited for Lua as a single threaded allocator.
I used this instead of the LuaJIT version as it is well documented and
just drop-in.

Regards