Re: Towards a faster interpreter

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Towards a faster interpreter
From: Sean Conner <sean@...>
Date: Thu, 8 Dec 2016 15:36:24 -0500

It was thus said that the Great Dibyendu Majumdar once stated:
> On 8 December 2016 at 18:46, Sean Conner <sean@conman.org> wrote:
> > It was thus said that the Great Roberto Ierusalimschy once stated:
> >> > I have been thinking of ways of making the Ravi interpreter faster
> >> > without having to resort to assembly code etc. So far one of the areas
> >> > I have not really looked at is how to improve the bytecode decoding.
> >> > The main issue is that the operands B and C require an extra bit
> >> > compared to operand A, so we cannot make them all 8 bits...
> >>
> >> Given a 32- or 64-bit machine, why decoding 8-bit operands would be
> >> faster/better than 9-bit ones?
> >
> >   I would think less shifting, but not being 100% sure, I decided to test my
> > assumptions.  I wrote:
> >
> > void split8(unsigned int *dest,unsigned int op)
> > {
> >   dest[0] = (op >> 24) & 0xFF;
> >   dest[1] = (op >> 16) & 0xFF;
> >   dest[2] = (op >>  8) & 0xFF;
> >   dest[3] = (op      ) & 0xFF;
> > }
> 
> Perhaps one of the masks could be eliminated?

  Well, here's the assembly (gcc -O3 -fomit-frame-pointer) for the above:

   0:	89 f0                	mov    eax,esi
   2:	48 89 f2             	mov    rdx,rsi
   5:	c1 e8 18             	shr    eax,0x18
   8:	89 07                	mov    DWORD PTR [rdi],eax
   a:	89 f0                	mov    eax,esi
   c:	81 e6 ff 00 00 00    	and    esi,0xff
  12:	c1 e8 10             	shr    eax,0x10
  15:	89 77 0c             	mov    DWORD PTR [rdi+0xc],esi
  18:	25 ff 00 00 00       	and    eax,0xff
  1d:	89 47 04             	mov    DWORD PTR [rdi+0x4],eax
  20:	0f b6 c6             	movzx  eax,dh
  23:	89 47 08             	mov    DWORD PTR [rdi+0x8],eax
  26:	c3                   	ret    

The compiler eliminated two of the masks (op >> 24 and op)---in the source
they're there for clarity.  Using bit-fields generated this code:

   0:	40 0f b6 c6          	movzx  eax,sil
   4:	48 89 f2             	mov    rdx,rsi
   7:	89 07                	mov    DWORD PTR [rdi],eax
   9:	0f b6 c6             	movzx  eax,dh
   c:	89 47 04             	mov    DWORD PTR [rdi+0x4],eax
   f:	89 f0                	mov    eax,esi
  11:	c1 ee 18             	shr    esi,0x18
  14:	c1 e8 10             	shr    eax,0x10
  17:	89 77 0c             	mov    DWORD PTR [rdi+0xc],esi
  1a:	0f b6 c0             	movzx  eax,al
  1d:	89 47 08             	mov    DWORD PTR [rdi+0x8],eax
  20:	c3                   	ret    

Which I suspect is optimum (given the x86-64b calling convention) and I'm
having a hard time seeing any wasted instructions here.  The difference is
only one additional instruction, so I would say use which ever one you think
is clearer.

  -spc

References:
- Towards a faster interpreter, Dibyendu Majumdar
- Re: Towards a faster interpreter, Roberto Ierusalimschy
- Re: Towards a faster interpreter, Sean Conner
- Re: Towards a faster interpreter, Dibyendu Majumdar

Prev by Date: Re: Towards a faster interpreter
Next by Date: How to include static files into luarock?
Previous by thread: Re: Towards a faster interpreter
Next by thread: How to include static files into luarock?
Index(es):
- Date
- Thread