lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Even without denormals, if LuaJIT was using SSE for floating point and the C version wasn't that would probably explain a fair portion of the difference.

Mark

On Dec 20, 2011, at 4:05 AM, Ico Doornekamp wrote:

> A small message to let the list know the source of my problem: it seems
> that the code caused a lot of calculations resulting in 'denormal
> numbers', which tend to be handled much slower on some hardware [1]. My
> solution (workaround?) was to enable SSE and add the -ffast-math flag to
> gcc to tell the compiler I don't really care about very precise answers.
> 
> I'm not sure how denormals affect luajit, but it seems that in this case
> this is no problem for the luajit implementation.
> 
> 
> 1. http://en.wikipedia.org/wiki/Denormal_number#Performance_issues
> 
> 
> * On Tue Dec 20 11:24:23 +0100 2011, Eike Decker wrote:
> 
>> Just a guess into the blue, but maybe it's a double /into conversion in the
>> first loop? You are mixing into and double values there.... what if you
>> make sure that all arithmetical operations are done in double precision?
>> On Dec 20, 2011 11:05 AM, "Ico Doornekamp" <lua@zevv.nl> wrote:
>> 
>>> * On Tue Dec 20 09:38:28 +0100 2011, steve donovan wrote:
>>> 
>>>> On Tue, Dec 20, 2011 at 10:33 AM, Ross Bencina
>>>> <rossb-lists@audiomulch.com> wrote:
>>>>> What are your results if you comment these out?
>>>> 
>>>> Precisely my thought. Commented out the print/printf, pushed S up to
>>>> 10000, and the luajit and C times are practically the same, at about
>>>> 0.6 sec.
>>> 
>>> Ok, so there seems to be some kind of architecture / implementation
>>> thing going on. I changed the programs not to print anything, new code
>>> attached below. The time results:
>>> 
>>> plain lua:    : 3.952s
>>> luajit:       : 0.055s
>>> gcc -O0       : 1.394s
>>> gcc -O3       : 1.395s
>>> 
>>> in which gcc is still 25 times slower as luajit!
>>> 
>>> Not the above numbers are still measured on my core 2 duo.
>>> 
>>> I just did the same test on a Intel Xeon @ 3.00GHz:
>>> 
>>> plain lua:    : 1.367s
>>> luajit:       : 0.060s
>>> gcc -O0       : 0.367s
>>> gcc -O3       : 0.014s
>>> 
>>> Things start to get interesting here: I think there might be an issue with
>>> optimization on my core 2, since there is virtually no difference between
>>> the
>>> unoptimized and the optimized versions. On the Xeon the results are as
>>> expected
>>> though, with C coming out ahead of luajit, but not by much.
>>> 
>>> I guess my problem has no place on the lua list after all. Apologies for
>>> the
>>> noise, I will move to the appropriate mailing list, as soon as I find out
>>> where
>>> I need to go :)
>>> 
>>> Thanks,
>>> 
>>> Ico
>>> 
>>> 
>>> 
>>> ----------------------------------------------------------------------
>>> 
>>> local N = 4000
>>> local S = 1000
>>> 
>>> local t = {}
>>> 
>>> for i = 0, N do
>>>  t[i] = {
>>>     a = 0,
>>>     b = 1,
>>>     f = i * 0.25
>>>  }
>>> end
>>> 
>>> for j = 0, S-1 do
>>>  for i = 0, N-1 do
>>>     t[i].a = t[i].a + t[i].b * t[i].f
>>>     t[i].b = t[i].b - t[i].a * t[i].f
>>>  end
>>> end
>>> 
>>> return t[1].a
>>> 
>>> ----------------------------------------------------------------------
>>> 
>>> #include <stdio.h>
>>> 
>>> #define N 4000
>>> #define S 1000
>>> 
>>> struct t {
>>>       double a, b, f;
>>> };
>>> 
>>> int main(int argc, char **argv)
>>> {
>>>       int i, j;
>>>       struct t t[N];
>>> 
>>>       for(i=0; i<N; i++) {
>>>               t[i].a = 0;
>>>               t[i].b = 1;
>>>               t[i].f = i * 0.25;
>>>       };
>>> 
>>>       for(j=0; j<S; j++) {
>>>               for(i=0; i<N; i++) {
>>>                       t[i].a += t[i].b * t[i].f;
>>>                       t[i].b -= t[i].a * t[i].f;
>>>               }
>>>       }
>>> 
>>>       return t[1].a;
>>> }
>>> 
>>> --
>>> :wq
>>> ^X^Cy^K^X^C^C^C^C
>>> 
>>> 
> -- 
> :wq
> ^X^Cy^K^X^C^C^C^C
> 
>