• Subject: Re: sincos optimization - lua and luajit
• From: "Dimiter \"malkia\" Stanev" <malkia@...>
• Date: Sat, 23 Jul 2011 20:08:10 -0700

I'd rather have them exposed as one function sincos() or cossin() returning two numbers.
```
```
How would luajit handle the fact if math.cos is something else? I dunno, I'm not Mike, but I have feeling that more "atomic" representations (such as cossin, or sincos might be more doable, and they could be right now). I guess if I want to add this in luajit, I'll just find a function that returns two things and copycat it.
```
On 7/23/11 7:25 PM, David Manura wrote:

```
```There are a number of occasions (e.g. rotation transformations) where
both the sine and cosine of a number need to be computed together, and
it can be more efficient to do this as a single operation [1,2].  To
take a trivial benchmark,

#include<stdio.h>
#include<math.h>
inline double f(double angle) {
double s = sin(angle);
double c = cos(angle);
return s+c;
}
int main(void) {
double i;
double sum = 0;
for (i=1; i<1e7; i++) { sum += f(i); }
printf("%e\n", sum);
return 0;
}

gcc4.4.4 (even under "-ffast-math -msse2") compiles that to a fsincos
instruction.  Intel C++2011 compiles it to a ___libm_sse2_sincos call.
MSVC++2010 compiles it to separate __CIcos/__CIsin calls, fsin/fcos
under -fp:fast, or ___libm_sse2_sin/___libm_sse2_cos under -arch:SSE2.

Here's what we get with LuaJIT2:

local function f(angle)
local s = math.sin(angle)
local c = math.cos(angle)
return s + c
end
local sum = 0
for i=1,1e7 do sum = sum + f(i) end
print(sum)

0027 ------ LOOP ------------
0028    num CONV   0025  num.int
0029    num FPMATH 0028  sin
0030    num FPMATH 0028  cos
0032  + num ADD    0031  0024
0033  + int ADD    0025  +1
0034>   int LE     0033  +10000000
0035    int PHI    0025  0033
0036    num PHI    0024  0032
---- TRACE 1 mcode 352

->LOOP:
b77d7fb0  xorps xmm6, xmm6
b77d7fb3  cvtsi2sd xmm6, edi
b77d7fb7  movsd [esp+0x8], xmm6
b77d7fbd  fld qword [esp+0x8]
b77d7fc1  fsin
b77d7fc3  fstp qword [esp]
b77d7fc6  movsd xmm5, [esp]
b77d7fcb  fld qword [esp+0x8]
b77d7fcf  fcos
b77d7fd1  fstp qword [esp]
b77d7fd4  movsd xmm6, [esp]
b77d7fe4  cmp edi, 0x00989680
b77d7fea  jle 0xb77d7fb0	->LOOP
b77d7fec  jmp 0xb77d0014	->3
---- TRACE 1 stop ->  loop

I suppose the optimizer could recognize the adjacent sin/cos calls in
the IR and merge them to fsincos.  If compiling sincos to SSE2, you
might need a library like http://gruntthepeon.free.fr/ssemath/ .

This all doesn't seem to make a whole lot of difference though.
___libm_sse2_sincos is actually a little slower than the fsincos here
and the speedup is only maybe 30% than with the separate fsin/fcos
instructions, but it depends on your library implementation and its
accuracy level.  It may make a bit more difference in standard Lua,
and the lqd binding has one [3].  Even Lua has the somewhat related
math.atan2, though not for the same reasons.  Here's an example of it

static int math_sincos (lua_State *L) {
lua_Number x = luaL_checknumber(L, 1);
lua_pushnumber(L, l_tg(sin)(x));
lua_pushnumber(L, l_tg(cos)(x));
return 2;
}

[1] http://linux.die.net/man/3/sincos
[2] http://stackoverflow.com/questions/2683588/what-is-the-fastest-way-to-compute-sin-and-cos-together
[3] http://lua-users.org/lists/lua-l/2009-04/msg00143.html

```
```

```