Ok, I've enabled those flags just in the luaV_execute (I looked into ravi sources)
```
#if defined(__GNUC__) && !defined(__clang__)
__attribute((optimize("no-crossjumping,no-gcse")))
#endif
void luaV_execute (lua_State *L, CallInfo *ci) {
```
Numbers are still good!
About the allocators, I've tried dlmalloc 2.8.6 and compared with the others,
I did not use the MSPACE interface though, I've used the usual malloc/realloc/free interface.
The following numbers are under the same conditions as the original post
(1+2+3+4+5 optimizations) and also enabled "-fno-crossjumping -fno-gcse." in lvm.c:
dlmalloc 0.820s (-23.4% from baseline)
rpmalloc 0.760s (-29.0% from baseline)
mimalloc 0.747s (-30.3% from baseline)
Notice mimalloc went from 0.777 seconds. (-27.7%) to 0.747s (-30.3%) due to the "-fno-crossjumping -fno-gcse." flags in lvm.c.
I will stick with rpmalloc because it performs better in my use case and is smaller, faster and can still bundle in a single C file,
while mimalloc is a burden as dependency.
Perhaps I could optimize rpmalloc later to remove atomic operations, thread locals and the numbers could be better.
Regards,
Eduardo Bart