LuaJIT2 performance for number crunching

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: LuaJIT2 performance for number crunching
From: Francesco Abbate <francesco.bbt@...>
Date: Sun, 13 Feb 2011 01:39:42 +0100

Hi all,

I have been quite excited in the last weeks about LuaJIT2 and its new
FFI library. As you know the performance of LuaJIT2 in term of
execution speed can, in some cases, approach the speed of C or Fortran
compiled code. The FFI library is also very interesting because it
does allow a very easy way to call directly C routine but also to
improve the efficiency by using native C structures and arrays to
greatly improves, in some cases, the memory usage and speed if
compared with a standard Lua implementation based on Lua tables.

As many have already pointed out there is a big problem when using
external C libraries that repeatedly calls Lua callback functions,
like Leo Rozumov was mentioning for numerical integration routines
that requires a pointer to a C function to evaluate the function. The
problem is that in these circumstances LuaJIT is not able to optimize
the Lua callback function because it is called from the C code.

I did think a lot of think about this problem because this is one of
the major problem of GSL Shell with routine like ODE integration,
non-linear fit and numerical integration. All these function requires
to call many time a Lua function from C Code and the performance are
very bad in term of execution speed.

Some suggested to rewrite the C routines in pure Lua to take advantage
of the exceptional performances of LuaJIT2 and avoid the C callback
problem altogether.

In order to test how effective was actually the LuaJIT2 for heavily
numeric code I've chosen a classical examples, the Runge-Kutta 4th
order ODE integration method, to test the LuaJIT2 performances versus
C optimized code. For the implementation I've taken the GSL
implementation of the method, the rk4.c is taken from the GSL library
and is included so that people can easily look at it.

I've implemented straightforwardly the same routines 'step' and
'apply' is pure Lua to compare the performance. The idea is that we
write exactly the same implementation of the algorithm in Lua and in C
and we test the results.

A this point a remark is needed about the implementation of the
algorithm. It does works for system of any dimension and it works by
using an array of doubles to store the computed values and
derivatives. As in this example we have only a 2-element
vector it would have been easy to unpack the components to have a more
efficient Lua code. The idea is that we don't want
to do that because we want to have an algorithm that can work for ODE
systems of any dimension, from 1 to N where N is possibly a big
integer.

In Lua I've got a few options to store the array: Lua native tables,
VLA (variable length array) with FFI and GSL matrices as implemented
in GSL shell. These latters have the handicap that every single
read/write access to the element of the matrix is made by a C function
that checks the index to detect out-of-range errors.

The Lua code is included and I believe it should be quite clear by itself.


Here the results of the benchmark:

LuaJIT2 with FFI, 0m47.805s
LuaJIT2 with GSL, 1m20.958s
LuaJIT2 with Lua tables, 0m9.319s
Lua 5.1.4, 0m26.644s
C code with GSL, (compiled with -O2): 0m0.607s

It is possible that I've made some errors in the benchmark, I will
wait for comments but it seems to me that te results make sense.

What Is remark is:
LuaJIT2 gives better results when using Lua tables because probably
can do the better optimizations. In this case it is x15 times slower
than plain C code.

Performance of Lua 5.1.4 is x2.85 times slower that LuaJIT2, not so
bad indeed. Is my benchmark flawed ?

For the other side I was surprised by the FFI performance that are
quite bad. I guess it is due to less effective optimizations made by
the JIT compiler but slower then plain Lua ??

Any comment is welcome on this results. I know that I may have made
many errors in the benchmark so don't be too much harsh with me :-)

-- 
Francesco

use = 'Lua'

if use == 'FFI' then
   ffi = require 'ffi'
   darray = ffi.typeof("double[?]")
elseif use == 'GSL' then
   darray = function(n) return new(n, 1) end
else
   darray = function(n) return {} end
end

rk4 = {}

function rk4.new(n)
  local s = {
    k=         darray(n+1), 
    k1=        darray(n+1),
    y0=        darray(n+1),
    ytmp=      darray(n+1),
    y_onestep= darray(n+1),
    dim = n
  }
  return s
end

function rk4.step(y, state, h, t, sys)
  -- Makes a Runge-Kutta 4th order advance with step size h.
  local dim = state.dim
  local f = sys.f

  -- initial values of variables y.
  local y0 = state.y0
  
  -- work space 
  local ytmp = state.ytmp

  -- Runge-Kutta coefficients. Contains values of coefficient k1
  -- in the beginning 
  local k = state.k

  -- k1 step 
  for i=1, dim do
    y[i] = y[i] + h / 6 * k[i]
    ytmp[i] = y0[i] + 0.5 * h * k[i]
  end
 
  -- k2 step

  f(t + 0.5 * h, ytmp, k)

  for i=1, dim do
    y[i] = y[i] + h / 3 * k[i]
    ytmp[i] = y0[i] + 0.5 * h * k[i]
  end

  -- k3 step 
  f(t + 0.5 * h, ytmp, k)

  for i=1, dim do
    y[i] = y[i] + h / 3 * k[i]
    ytmp[i] = y0[i] + h * k[i]
  end

  -- k4 step 
  f(t + h, ytmp, k)

  for i=1, dim do
    y[i] = y[i] + h / 6 * k[i]
  end
end

function rk4.apply(state, t, h, y, yerr, dydt_in, dydt_out, sys)
  local f, dim = sys.f, state.dim
  local k, k1, y0, y_onestep = state.k, state.k1, state.y0, state.y_onestep

  for j=1,dim do y0[j] = y[j] end

  if dydt_in then 
     for j=1,dim do k[j] = dydt_in[j] end
  else 
     f(t, y0, k)
  end

  -- Error estimation is done by step doubling procedure 
  -- Save first point derivatives
  for j=1,dim do k1[j] = k[j] end

  -- First traverse h with one step (save to y_onestep) 
  for j=1,dim do y_onestep[j] = y[j] end

  rk4.step (y_onestep, state, h, t, sys)

  -- Then with two steps with half step length (save to y) 
  for j=1,dim do k[j] = k1[j] end

  rk4.step(y, state, h/2, t, sys)

  -- Update before second step 
  f(t + h/2, y, k)
  
  -- Save original y0 to k1 for possible failures 
  for j=1,dim do k1[j] = y0[j] end

  -- Update y0 for second step 
  for j=1,dim do y0[j] = y[j] end

  rk4.step(y, state, h/2, t + h/2, sys)

  -- Derivatives at output
  if dydt_out then f(t + h, y, dydt_out) end
  
  -- Error estimation
  --
  --   yerr = C * 0.5 * | y(onestep) - y(twosteps) | / (2^order - 1)
  --
  --   constant C is approximately 8.0 to ensure 90% of samples lie within
  --   the error (assuming a gaussian distribution with prior p(sigma)=1/sigma.)

  for i=1, dim do
    yerr[i] = 4 * (y[i] - y_onestep[i]) / 15
  end
end

function f_ode1(t, y, dydt)
   local p, q = y[1], y[2]
   dydt[1] = - q - p^2
   dydt[2] = 2*p - q^3
end

t0, t1, h0 = 0, 200, 0.001

function do_rk(p0, q0, sample)
  local dim = 2
  local state = rk4.new(dim)
  local y, dydt, yerr = darray(dim+1), darray(dim+1), darray(dim+1)
  local sys = {f = f_ode1}

  y[1], y[2] = p0, q0

  local t = t0
  local tsamp = t0
  rk4.apply(state, t, h0, y, yerr, nil, dydt, sys)
  t = t + h0
  while t < t1 do
     rk4.apply(state, t, h0, y, yerr, dydt, dydt, sys)
     t = t + h0
     if sample and t - tsamp > sample then
        print(t, y[1], y[2])
	tsamp = t
     end
  end
  print(t, y[1], y[2])
end

for k=1, 10 do
  local th = pi/4 -- *(k-1)/5
  local p0, q0 = cos(th), sin(th)
  do_rk(p0, q0)
end

#include <stdio.h>
#include <math.h>
#include <gsl/gsl_errno.h>
#include <gsl/gsl_matrix.h>
#include <gsl/gsl_odeiv.h>

#define ODE_SYS_DIM 2

int
f_ode1 (double t, const double y[], double f[], void *params)
{
  double p = y[0], q = y[1];
  f[0] = - q - p*p;
  f[1] = 2*p - q*q*q;
  return GSL_SUCCESS;
}

void
do_rk(gsl_odeiv_system *sys, double p0, double q0, double sample)
{
  gsl_odeiv_step *s = gsl_odeiv_step_alloc (gsl_odeiv_step_rk4, ODE_SYS_DIM);
  double y[ODE_SYS_DIM], dydt[ODE_SYS_DIM], yerr[ODE_SYS_DIM];
  double t = 0, t1 = 200, tsamp = 0, h0 = 0.001;
  
  y[0] = p0;
  y[1] = q0;

  gsl_odeiv_step_apply (s, t, h0, y, yerr, NULL, dydt, sys);
  t = t + h0;
  
  while (t < t1)
    {
      gsl_odeiv_step_apply (s, t, h0, y, yerr, dydt, dydt, sys);
      t = t + h0;
      if (sample > 0.0 && t - tsamp > sample)
	{
	  printf("%g %g %g\n", t, y[0], y[1]);
	  tsamp = t;
	}
    }

  printf(">> %g %g %g\n", t, y[0], y[1]);

  gsl_odeiv_step_free (s);
}

int
main()
{
  gsl_odeiv_system sys[1] = {{f_ode1, NULL, ODE_SYS_DIM, NULL}};
  int k;
  for (k = 0; k < 10; k++)
    {
      double th = 3.14159265359 / 4;
      double p0 = cos(th), q0 = sin(th);
      do_rk (sys, p0, q0, -1.0);
    }
  return 0;
}

/* ode-initval/rk4.c
 * 
 * Copyright (C) 1996, 1997, 1998, 1999, 2000 Gerard Jungman
 * 
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 3 of the License, or (at
 * your option) any later version.
 * 
 * This program is distributed in the hope that it will be useful, but
 * WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 * General Public License for more details.
 * 
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
 */

/* Runge-Kutta 4th order, Classical */

/* Author:  G. Jungman
 */

/* Reference: Abramowitz & Stegun, section 25.5. equation 25.5.10 

   Error estimation by step doubling, see eg. Ascher, U.M., Petzold,
   L.R., Computer methods for ordinary differential and
   differential-algebraic equations, SIAM, Philadelphia, 1998.
*/

#include <config.h>
#include <stdlib.h>
#include <string.h>
#include <gsl/gsl_errno.h>
#include <gsl/gsl_odeiv.h>

#include "odeiv_util.h"

typedef struct
{
  double *k;
  double *k1;
  double *y0;
  double *ytmp;
  double *y_onestep;
}
rk4_state_t;

static void *
rk4_alloc (size_t dim)
{
  rk4_state_t *state = (rk4_state_t *) malloc (sizeof (rk4_state_t));

  if (state == 0)
    {
      GSL_ERROR_NULL ("failed to allocate space for rk4_state", GSL_ENOMEM);
    }

  state->k = (double *) malloc (dim * sizeof (double));

  if (state->k == 0)
    {
      free (state);
      GSL_ERROR_NULL ("failed to allocate space for k", GSL_ENOMEM);
    }

  state->k1 = (double *) malloc (dim * sizeof (double));

  if (state->k1 == 0)
    {
      free (state->k);
      free (state);
      GSL_ERROR_NULL ("failed to allocate space for k1", GSL_ENOMEM);
    }

  state->y0 = (double *) malloc (dim * sizeof (double));

  if (state->y0 == 0)
    {
      free (state->k);
      free (state->k1);
      free (state);
      GSL_ERROR_NULL ("failed to allocate space for y0", GSL_ENOMEM);
    }

  state->ytmp = (double *) malloc (dim * sizeof (double));

  if (state->ytmp == 0)
    {
      free (state->y0);
      free (state->k);
      free (state->k1);
      free (state);
      GSL_ERROR_NULL ("failed to allocate space for ytmp", GSL_ENOMEM);
    }

  state->y_onestep = (double *) malloc (dim * sizeof (double));

  if (state->y_onestep == 0)
    {
      free (state->ytmp);
      free (state->y0);
      free (state->k);
      free (state->k1);
      free (state);
      GSL_ERROR_NULL ("failed to allocate space for ytmp", GSL_ENOMEM);
    }

  return state;
}

static int
rk4_step (double *y, const rk4_state_t *state,
	  const double h, const double t, const size_t dim,
	  const gsl_odeiv_system *sys)
{
  /* Makes a Runge-Kutta 4th order advance with step size h. */
  
  /* initial values of variables y. */
  const double *y0 = state->y0;
  
  /* work space */
  double *ytmp = state->ytmp;

  /* Runge-Kutta coefficients. Contains values of coefficient k1
     in the beginning 
  */
  double *k = state->k;

  size_t i;

  /* k1 step */

  for (i = 0; i < dim; i++)
    {
      y[i] += h / 6.0 * k[i];
      ytmp[i] = y0[i] + 0.5 * h * k[i];
    }

  /* k2 step */
  {
    int s = GSL_ODEIV_FN_EVAL (sys, t + 0.5 * h, ytmp, k);

    if (s != GSL_SUCCESS)
      {
	return s;
      }
  }

  for (i = 0; i < dim; i++)
    {
      y[i] += h / 3.0 * k[i];
      ytmp[i] = y0[i] + 0.5 * h * k[i];
    }

  /* k3 step */
  {
    int s = GSL_ODEIV_FN_EVAL (sys, t + 0.5 * h, ytmp, k);

    if (s != GSL_SUCCESS)
      {
	return s;
      }
  }

  for (i = 0; i < dim; i++)
    {
      y[i] += h / 3.0 * k[i];
      ytmp[i] = y0[i] + h * k[i];
    }

  /* k4 step */
  {
    int s = GSL_ODEIV_FN_EVAL (sys, t + h, ytmp, k);

    if (s != GSL_SUCCESS)
      {
	return s;
      }
  }

  for (i = 0; i < dim; i++)
    {
      y[i] += h / 6.0 * k[i];
    }

  return GSL_SUCCESS;
}


static int
rk4_apply (void *vstate,
           size_t dim,
           double t,
           double h,
           double y[],
           double yerr[],
           const double dydt_in[],
           double dydt_out[], 
           const gsl_odeiv_system * sys)
{
  rk4_state_t *state = (rk4_state_t *) vstate;

  size_t i;

  double *const k = state->k;
  double *const k1 = state->k1;
  double *const y0 = state->y0;
  double *const y_onestep = state->y_onestep;

  DBL_MEMCPY (y0, y, dim);

  if (dydt_in != NULL)
    {
      DBL_MEMCPY (k, dydt_in, dim);
    }
  else
    {
      int s = GSL_ODEIV_FN_EVAL (sys, t, y0, k);

      if (s != GSL_SUCCESS)
	{
	  return s;
	}
    }

  /* Error estimation is done by step doubling procedure */

  /* Save first point derivatives*/
  
  DBL_MEMCPY (k1, k, dim);

  /* First traverse h with one step (save to y_onestep) */

  DBL_MEMCPY (y_onestep, y, dim);

  {
    int s = rk4_step (y_onestep, state, h, t, dim, sys);

    if (s != GSL_SUCCESS) 
      {
        return s;
      }
  }

  /* Then with two steps with half step length (save to y) */ 

  DBL_MEMCPY (k, k1, dim);

  {
    int s = rk4_step (y, state, h/2.0, t, dim, sys);

    if (s != GSL_SUCCESS)
      {
	/* Restore original values */
	DBL_MEMCPY (y, y0, dim);
	return s;
    }
  }

  /* Update before second step */
  {
    int s = GSL_ODEIV_FN_EVAL (sys, t + h/2.0, y, k);

    if (s != GSL_SUCCESS)
      {
	/* Restore original values */
	DBL_MEMCPY (y, y0, dim);
	return s;
      }
  }
  
  /* Save original y0 to k1 for possible failures */
  DBL_MEMCPY (k1, y0, dim);

  /* Update y0 for second step */
  DBL_MEMCPY (y0, y, dim);

  {
    int s = rk4_step (y, state, h/2.0, t + h/2.0, dim, sys);

    if (s != GSL_SUCCESS)
      {
	/* Restore original values */
	DBL_MEMCPY (y, k1, dim);
	return s;
      }
  }

  /* Derivatives at output */

  if (dydt_out != NULL) {
    int s = GSL_ODEIV_FN_EVAL (sys, t + h, y, dydt_out);

    if (s != GSL_SUCCESS)
      {
	/* Restore original values */
	DBL_MEMCPY (y, k1, dim);
	return s;
      }
  }
  
  /* Error estimation

     yerr = C * 0.5 * | y(onestep) - y(twosteps) | / (2^order - 1)

     constant C is approximately 8.0 to ensure 90% of samples lie within
     the error (assuming a gaussian distribution with prior p(sigma)=1/sigma.)

  */

  for (i = 0; i < dim; i++)
    {
      yerr[i] = 4.0 * (y[i] - y_onestep[i]) / 15.0;
    }

  return GSL_SUCCESS;
}

static int
rk4_reset (void *vstate, size_t dim)
{
  rk4_state_t *state = (rk4_state_t *) vstate;

  DBL_ZERO_MEMSET (state->k, dim);
  DBL_ZERO_MEMSET (state->k1, dim);
  DBL_ZERO_MEMSET (state->y0, dim);
  DBL_ZERO_MEMSET (state->ytmp, dim);
  DBL_ZERO_MEMSET (state->y_onestep, dim);

  return GSL_SUCCESS;
}

static unsigned int
rk4_order (void *vstate)
{
  rk4_state_t *state = (rk4_state_t *) vstate;
  state = 0; /* prevent warnings about unused parameters */
  return 4;
}

static void
rk4_free (void *vstate)
{
  rk4_state_t *state = (rk4_state_t *) vstate;
  free (state->k);
  free (state->k1);
  free (state->y0);
  free (state->ytmp);
  free (state->y_onestep);
  free (state);
}

static const gsl_odeiv_step_type rk4_type = { "rk4",    /* name */
  1,                            /* can use dydt_in */
  1,                            /* gives exact dydt_out */
  &rk4_alloc,
  &rk4_apply,
  &rk4_reset,
  &rk4_order,
  &rk4_free
};

const gsl_odeiv_step_type *gsl_odeiv_step_rk4 = &rk4_type;

Follow-Ups:
- Re: LuaJIT2 performance for number crunching, Leo Razoumov
- Re: LuaJIT2 performance for number crunching, Mike Pall

Prev by Date: Re: super-nginx build with LuaJIT-beta6
Next by Date: LuaJIT2 vs. JavaScript
Previous by thread: Fwd: Re: hardware + Lua resources
Next by thread: Re: LuaJIT2 performance for number crunching
Index(es):
- Date
- Thread