lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Tue, Mar 13, 2012 at 2:53 PM, Hisham <h@hisham.hm> wrote:
> We get conflicting instructions here:
>
> Rob says to use -fPIC everywhere...
>
> On Tue, Mar 13, 2012 at 10:35 AM, Rob Kendrick <rjek@rjek.com> wrote:
>> I'm mildly astonished that this bug is still there.  -fPIC is required
>> when building a shared object, regardless of CPU architecture.
>
> ...while Jay says to use it only where needed.
>
>> On Tue, Mar 13, 2012 at 3:02 PM, Jay Carlson <nop@nop.com> wrote:
>>> As a side note, please do not write -fPIC until you get a "fails to
>>> build" bug report on some architecture (probably sparc). -fPIC can be
>>> drastically less efficient than -fpic

I should be more clear. I am talking about the difference between
-fpic and -fPIC. Both generate position-independent code. -fpic may
have an architecture-specific limit on the number of symbols in a
shared object; -fPIC does not, at the cost of using more instructions
on those architectures. This is only significant on some platforms;
the ones you'll run into are ppc, sparc, and m68k. Sun suggests it's
not deadly to just do everything in the larger -fPIC model in the
Solaris 11 Linker And Libraries Guide in the performance section.
http://docs.oracle.com/cd/E19963-01/html/819-0690/chapter4-10454.html#LLMindexterm-410
has the differences between getting the address of a word between the
two. With normal -fpic:

    ld    [%l7 + j], %o0    ! load &j into %o0

Note that this works because the symbol "j" is really the constant
offset of the address of j from the center of the Global Offset Table,
pointed to by %i7. There is a SPARC addressing mode allowing loads of
registers with a 13-bit immediate offset from an address in a
register. However, if we have lots of entries, the linker cannot place
them all within the range of those short immediate offsets. In that
case, we need -fPIC: use a longer sequence to load a 32-bit offset,
then address it from %i7:

    sethi %hi(j), %g1
    or    %g1, %lo(j), %g1    ! get 32–bit constant GOT offset
    ld    [%l7 + %g1], %o0    ! load &j into %o0

As usual in RISC architectures, loading an arbitrary 32-bit value into
%g1 takes two instructions.

SPARC has the lowest limit, at 2048 symbols in 32-bit mode. That's why
I say "until you get a bug report from a sparc user, stick to -fpic
instead of -fPIC".

> I used to follow Jay's advice and LuaRocks did only add -fPIC to
> CFLAGS when building specific architectures (x86_64, in that case).
> But then people started complaining and I just changed it to all
> architectures. I guess having it everywhere at least ensures it works
> (even if at sub-optimal performance) without having to check whether
> every possible architecture needs it (and that's not a matter of
> laziness -- I don't know every possible arch LR can run on, and I'd
> rather have it working out of the box on them than waiting for error
> reports and adding -fPIC on a case-by-case basis).

Every ELF SVR4 ABI architecture strongly encourages
position-independent code for dynamic objects, and as we know some
require it. i386 lets you get away with it, but at a price. Let's look
at the object code generated to implement "global_int++":

  ????: 83 05 00 00 00 00 01    addl    $0x1,0x0

global_int is not at location 0 of course. The assembler records a
relocation record describing how to fix up this code once the address
of global_int is known:

        ????+2: R_386_32    global_int

I've listed the address as ???? since the absolute address of this
instruction is not known until it is dynamically loaded. The same is
true of function calls as well:

  ????: e8 fc ff ff ff    call    ????
        ????+1: R_386_PC32    global_func

Normally executable code is marked read-only. This has a number of
benefits, both in security and that it can be shared between all
processes using it. But when non-position-independent code like the
above is loaded, the dynamic linker must make each page containing a
relocation writable, and then apply the relocations. This means every
process will have its own copy. PIC code places all addresses into the
data segment instead:

    ????: 8b 81 00 00 00 00    mov    0x0(%ecx),%eax
        ????+2: R_386_GOT32    global_int
    ????+6:	83 00 01           	addl    $0x1,(%eax)

where R_386_GOT32 is known when foo.so is created: a constant offset
from the Global Offset Table stored in %ecx. The dynamic loader does
not need to modify this executable code sequence at runtime because
the compiler/assembler have introduced one level of indirection.

How did the address of the GOT get into %ecx in the first place? Well,
the function prologue has to put it there, and because i386 has no way
of using the program counter in arithmetic it's a little painful.
x86_64 does not have this problem, although position independent code
still has other indirection costs.

So where does that leave LuaRocks? Many of the storage-sharing
arguments for the read-only text segments PIC allows are not really
relevant if your total executable code is, say, 8k. Ulrich Drepper's
generally excellent "How To Write Shared Libraries" just says "don't
write little DSOs" but Lua often has them as glue code. There are
other benefits from building DSOs as PIC, but one of the biggest is
consistency with all the other dynamically loaded objects on the
system. If you don't do it, people like me will say "of *course* they
should be PIC" and file bug reports even if the precise benefits in
the case of glue DSOs are unclear.

My advice is to put -fpic into CFLAGS for all loadable objects on all
ELF platforms. If somebody hits a case where -fPIC is necessary
instead, deal with it on a package-by-package basis.

Jay