Re: so performance

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: so performance
From: Mike Pall <mikelu-0507@...>
Date: Mon, 25 Jul 2005 16:37:51 +0200

Hi,

David Burgess wrote:
> Mike mentions that .so performance is less than the statically
> linked version. Knowing more about Win32 than Unix can
> you tell me why the performance hit with .so? With WIn32 once
> the address resolution is performed there is no significant
> difference. I am curious.

Win32 DLLs are not position independent. As you mention,
they need to be relocated during loading (unless the fixed
base it was built with happens to be unused in the loading
process). Relocations to the same address can be shared,
but if different processes have conflicting memory layouts,
the loader needs to generate multiple copies of the DLL in
memory. This is the reason why some IDEs try very hard to
give all DLLs non-overlapping addresses. All the standard
Windows DLLs live at different addresses, too.

ELF .so-files (commonly used in Linux, BSD, Solaris, ...)
usually contain position independent code (PIC). This means
they do not need to be relocated during loading (except for
some glue code and data->data pointers). All code pages
can be shared amongst all processes using the same library
and can be paged to/from the filesystem (and not the swap).

But ... and here comes the catch: x86 code was never really
meant to be position-independent. While all jumps and
calls are instruction-pointer-relative, there is no simple
(or fast) way to address data relative to the current code
location. This means that all references to static globals
must go through an indirection. So one of the registers (EBX)
is set up to contain a GOT (global offset table) reference
and all static data references must go through that.

The real bad news is that you loose one general-purpose
register. And this on the already register-starved x86!

Modern compilers try all tricks in the bag to avoid that
(global code analysis), but often enough they can't.
This generates much worse code and often in the spots
where you'd need it most (such as in the inner loop
of a bytecode interpreter :-/ ).

[The recent addition of
   #define LUAI_FUNC __attribute__((visibility("hidden")))
to Lua 51w6 helps the compiler a bit with code analysis,
in case you were wondering.]

So ... this is the reason why you probably should avoid
to compile high-performance code as PIC ELF libraries on
x86 CPUs.

You can either link it statically into the (non-PIC) executable.
Or you can generate non-PIC ELF libraries. They will still load,
but some systems warn you about it. There have been heated
debates (elsewhere) about policies forcing this. I won't
go into that here (again).

This is also the reason why both Perl and Python compile
all of their core code into the executable _and_ provide a
shared library (with the same code, but PIC compiled).

We have already discussed this at length in the 'LuaBinaries'
thread on the list some time ago. The concensus was that
the Lua core should be compiled into an executable and
not into a shared library. Except for WIN32 where it makes
most sense to compile it as a DLL (which has no negative
side-effects).

Side-note: x64 (aka x86_64/AMD64/EM64T) fixes all this and
has good support for PIC. Amusingly the addressing mode
is called 'RIP-relative' (RIP being the 64 bit instruction
pointer register).

Bye,
     Mike

References:
- Re: Re: Lua is faster than Java?, ouli
- Re: Lua is faster than Java?, Mike Pall
- so performance, David Burgess

Prev by Date: Re: Lua 5 Grammar update
Next by Date: Re: Lua 5 Grammar update
Previous by thread: Re: so performance
Next by thread: Re: Lua is faster than Java?
Index(es):
- Date
- Thread