sincos optimization - lua and luajit

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: sincos optimization - lua and luajit
From: David Manura <dm.lua@...>
Date: Sat, 23 Jul 2011 22:25:29 -0400

There are a number of occasions (e.g. rotation transformations) where
both the sine and cosine of a number need to be computed together, and
it can be more efficient to do this as a single operation [1,2].  To
take a trivial benchmark,

#include <stdio.h>
#include <math.h>
inline double f(double angle) {
	double s = sin(angle);
	double c = cos(angle);
	return s+c;
}
int main(void) {
	double i;
	double sum = 0;
	for (i=1; i<1e7; i++) { sum += f(i); }
	printf("%e\n", sum);
	return 0;
}

gcc4.4.4 (even under "-ffast-math -msse2") compiles that to a fsincos
instruction.  Intel C++2011 compiles it to a ___libm_sse2_sincos call.
 MSVC++2010 compiles it to separate __CIcos/__CIsin calls, fsin/fcos
under -fp:fast, or ___libm_sse2_sin/___libm_sse2_cos under -arch:SSE2.

Here's what we get with LuaJIT2:

  local function f(angle)
    local s = math.sin(angle)
    local c = math.cos(angle)
    return s + c
  end
  local sum = 0
  for i=1,1e7 do sum = sum + f(i) end
  print(sum)

  0027 ------ LOOP ------------
  0028    num CONV   0025  num.int
  0029    num FPMATH 0028  sin
  0030    num FPMATH 0028  cos
  0031    num ADD    0030  0029
  0032  + num ADD    0031  0024
  0033  + int ADD    0025  +1
  0034 >  int LE     0033  +10000000
  0035    int PHI    0025  0033
  0036    num PHI    0024  0032
  ---- TRACE 1 mcode 352

  ->LOOP:
  b77d7fb0  xorps xmm6, xmm6
  b77d7fb3  cvtsi2sd xmm6, edi
  b77d7fb7  movsd [esp+0x8], xmm6
  b77d7fbd  fld qword [esp+0x8]
  b77d7fc1  fsin
  b77d7fc3  fstp qword [esp]
  b77d7fc6  movsd xmm5, [esp]
  b77d7fcb  fld qword [esp+0x8]
  b77d7fcf  fcos
  b77d7fd1  fstp qword [esp]
  b77d7fd4  movsd xmm6, [esp]
  b77d7fd9  addsd xmm6, xmm5
  b77d7fdd  addsd xmm7, xmm6
  b77d7fe1  add edi, +0x01
  b77d7fe4  cmp edi, 0x00989680
  b77d7fea  jle 0xb77d7fb0	->LOOP
  b77d7fec  jmp 0xb77d0014	->3
  ---- TRACE 1 stop -> loop

I suppose the optimizer could recognize the adjacent sin/cos calls in
the IR and merge them to fsincos.  If compiling sincos to SSE2, you
might need a library like http://gruntthepeon.free.fr/ssemath/ .

This all doesn't seem to make a whole lot of difference though.
___libm_sse2_sincos is actually a little slower than the fsincos here
and the speedup is only maybe 30% than with the separate fsin/fcos
instructions, but it depends on your library implementation and its
accuracy level.  It may make a bit more difference in standard Lua,
and the lqd binding has one [3].  Even Lua has the somewhat related
math.atan2, though not for the same reasons.  Here's an example of it
added to lmathlib.c:

  static int math_sincos (lua_State *L) {
    lua_Number x = luaL_checknumber(L, 1);
    lua_pushnumber(L, l_tg(sin)(x));
    lua_pushnumber(L, l_tg(cos)(x));
    return 2;
  }

[1] http://linux.die.net/man/3/sincos
[2] http://stackoverflow.com/questions/2683588/what-is-the-fastest-way-to-compute-sin-and-cos-together
[3] http://lua-users.org/lists/lua-l/2009-04/msg00143.html

Follow-Ups:
- Re: sincos optimization - lua and luajit, Dimiter "malkia" Stanev
- Re: sincos optimization - lua and luajit, Dirk Laurie
- Re: sincos optimization - lua and luajit, Mike Pall

Prev by Date: Re: module 'pylist' not found
Next by Date: Re: [ANN] luaffi (ffi library ala luajit's for the standard lua vm)
Previous by thread: Re: module 'pylist' not found
Next by thread: Re: sincos optimization - lua and luajit
Index(es):
- Date
- Thread