lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Hello Jairo,

On Sat, Dec 05, 2020 at 03:47:05AM -0500, Jairo A. del Rio wrote:
> ...

Thank you for this opportunity to indulge in prime hunting, which, as we
all know, is one of the most healthy and useful activities.

The Xuedong Luo method and your implementation are markedly tight, with
priority to performance in the details. In contrast, I have attempted a
more relaxed implementation of Eratosthenes' sieve (a module Sieve1.lua)
to make it easier to understand what goes on and experiment with
variations. That implementation, in spite of this more relaxed approach,
is, in some cases, faster than yours.

Working on a description of Sieve1.lua, I found, however, that simpler
implementations, developed to support the description, also had merit
and actually led to some significant performance improvements of your
implementation of the Xuedong Luo method. So that's where I'll start.

Before doing that, however, I'll briefly summarize the adjustments to
your luo.lua in the shape of a diff between luo2.lua (an instrumented
version of your luo.lua) and luo4.lua that includes the adjustments:

  $ diff luo2.lua luo4.lua
  24c24
  <         step = step == 2 and 4 or 2
  ---
  >         step = 6 - step
  33c33
  <         j = c
  ---
  >         local j = c
  36,40c36,42
  <         while j <= M do
  <             S[j] = 0
  <             j = j + ij
  <             ij = t - ij
  <             count = count + 1
  ---
  >         if S[i] ~= 0 then
  >             while j <= M do
  >                 S[j] = 0
  >                 j = j + ij
  >                 ij = t - ij
  >                 count = count + 1
  >             end
  46,48c48,50
  <     for _, v in next, S do
  <         if v > 0 then
  <             result[#result + 1] = v
  ---
  >     for i = 1, #S do
  >         if S[i] > 0 then
  >             result[#result + 1] = S[i]

  $

These adjustments cut roughly a third of the run time required for
calculating the primes below 100000. Details follow.

The basic mechanism, attributed to Eratosthenes (about -276 to -195),
for finding primes is the following (see
https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes):

Start by listing all positive integers to some desired limit, here 25:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

The number 1 is not considered prime (and if used for sieving would
create havoc), so take the next number, 2, in the list, that's the first
prime. Now, since any number divisible by a prime is composite (unless
it is the prime itself), we can remove as composite all multiples of 2
from the list, excepting 2, leaving us with:

 1 2 3 5 7 9 11 13 15 17 19 21 23 25

The next number following 2 in the list is 3, so that's the second
prime. As before, we remove multiples of that prime from the list,
excepting the prime itself, to get rid of additional composites, leaving
us with:

 1 2 3 5 7 11 13 17 19 23 25

The next number in the list is 5, the third prime. We haven't said that
before, but it actually suffices, when removing multiples of some prime
p, to start at p^2 (= p*p, p squared). The reason is that composite
numbers below p^2 are divisible by a prime strictly less than p and so,
would have been removed earlier. For example, for p=5, 10=2*5 were
removed when removing multiples of 2. And 15=3*5 were removed when
removing multiples of 3.

The only multiple of 5 in the list, excluding 5 itself, is 25. After
removing 25 from the list, we are left with:

 1 2 3 5 7 11 13 17 19 23

Argueing as before, 7 is the fourth prime and 7*7 = 49 is the next
number targeted for removal. But 49 is beyond the list, so we are done
and all numbers in the list (except 1) are prime.

In Lua, this becomes Eratos1.lua:

  $ cat Eratos1.lua
  -- Eratos1.lua: Inspired by "Jairo A. del Rio" <jairoadelrio6@gmail.com>: Tips and questions concerning a prime sieve in Lua, simple sieve
  -- 2021-Apr-01 18.56 / TN

  print( "Eratos1.lua: 2021-Apr-28 21.21 / TN" )

  -- for k, v in pairs( arg ) do
    -- print( "arg[" .. k .. "] = <" .. v  .. ">" )
  -- end

  local print = print
  local tonumber = tonumber
  local arg = arg

  _ENV = nil

  local function Eratos1( l )
    local list = {}
    for i = 1, l do list[i] = i end
    local x = 1
    local count = 0
    while true do
      local p
      repeat
        x = x + 1
        p = list[x]
      until p ~= 0
      if not p or p*p > l then
        break
      end
      -- print( p )
      for i = p*p, l, p do
        list[i] = 0
        -- count = count + 1
      end
    end
    local prime = {}
    -- do return prime end
    for i = 2, l do
      if list[i] ~= 0 then
        -- print( "prime[" .. i .. "] = " .. list[i] )
        prime[#prime+1] = list[i]
      end
    end
    return prime, count
  end

  local l = tonumber( arg[1] )
  local c = tonumber( arg[2] or 1 )
  local result
  local count
  for i = 1, c do
    result, count = Eratos1( l )
  end
  print( "pi(" .. l .. ") = " .. #result .. ", count=" .. count .. " (calculated " .. c .. " times)" )
  $

(Much code quoted is work in progress, so please apologize for
occasional rough edges, such as remains of debugging activities.)
Briefly, list[i] represents the number i and is initialized with i
itself for a start. Subsequently, when sieving establishes that some i
is composite, list[i] is zeroed. Finally, the list is scanned for
non-zero numbers: These are all the primes and hence transferred to the
result prime list.

Establishing a suitable baseline:

  $ uname -a
  CYGWIN_NT-6.1-WOW Annette-Pc2 3.1.7(0.340/5/3) 2020-08-22 19:03 i686 Cygwin
  $ time lua-5.4.3 Eratos1.lua 1000000 100
  Eratos1.lua: 2021-Apr-28 21.21 / TN
  pi(1000000) = 78498, count=0 (calculated 100 times)

  real    0m27.616s
  user    0m24.585s
  sys     0m2.870s
  $

The calculation is summarized by printing pi(1000000), the number of
primes below the limit 1000000. Specific values of the prime-counting
function pi(x) can be found at

  https://en.wikipedia.org/wiki/Prime-counting_function.

Eratos1.lua with l=1000000 and c=100 performs the same calculation as
your (Jairo's) luo.lua:

  $ time lua-5.4.3 luo.lua

  real    0m21.828s
  user    0m20.373s
  sys     0m1.388s
  $

The situation is thus not entirely hopeless: Your program is faster,
yes, but not dramatically so.

So how can we improve this? A common idea used in the Xuedong Luo method
is making a special case of the smallest primes. To illustrate this,
let's make a special case of 2. This means that we can avoid handling
numbers divisible by 2, so some operations need only be carried out half
the number of times. This is Eratos2.lua:

  $ cat Eratos2.lua
  -- Eratos2.lua: Inspired by "Jairo A. del Rio" <jairoadelrio6@gmail.com>: Tips and questions concerning a prime sieve in Lua, special case 2 sieve
  -- 2021-Apr-01 18.58 / TN

  print( "Eratos2.lua: 2021-Apr-09 19.50 / TN" )

  local print = print
  local tonumber = tonumber
  local arg = arg
  local os = os

  _ENV = nil

  local function Eratos2( l, clockAccum )
    local clockStart
    local clockEnd
    clockStart = os.clock()
    local l0 = ( l + 1 ) // 2
    local list = {}
    for i = 1, l0 do list[i] = 2*i - 1 end
    local x = 1
    local count = 0
    clockEnd = os.clock()
    clockAccum.Init = (clockAccum.Init or 0) + clockEnd - clockStart
    clockStart = clockEnd
    while true do
      local p
      repeat
        x = x + 1
        p = list[x]
      until p ~= 0
      if not p or p*p > l then
        break
      end
      -- print( p )
      for i = ( p*p + 1 ) // 2, l0, p do
        list[i] = 0
        count = count + 1
      end
    end
    clockEnd = os.clock()
    clockAccum.Sieve = (clockAccum.Sieve or 0) + clockEnd - clockStart
    clockStart = clockEnd
    local prime = {}
    -- do return prime end
    if 2 <= l then
      prime[#prime+1] = 2
    end
    for i = 2, l0 do
      if list[i] ~= 0 then
        -- print( "prime[" .. i .. "] = " .. list[i] )
        prime[#prime+1] = list[i]
      end
    end
    clockEnd = os.clock()
    clockAccum.Pack = (clockAccum.Pack or 0) + clockEnd - clockStart
    return prime, count
  end

  local l = tonumber( arg[1] )
  local c = tonumber( arg[2] or 1 )
  local result
  local count
  local clockStart = os.clock()
  local clockAccum = {}
  for i = 1, c do
    result, count = Eratos2( l, clockAccum )
  end
  local clockEnd = os.clock()
  print( "pi(" .. l .. ") = " .. #result .. ", count=" .. count .. " (calculated " .. c .. " times)" )
  print( "clockAccum.Init=" .. clockAccum.Init )
  print( "clockAccum.Sieve=" .. clockAccum.Sieve )
  print( "clockAccum.Pack=" .. clockAccum.Pack )
  print( "clockTotal=" .. clockEnd - clockStart )
  $

In Eratos2.lua, the list only holds the odd numbers, so list[1]=1,
list[2]=3, list[3]=5, and so on.  The number of elements of the list is
calculated as l0 = (l+1)//2 for the limit l. This ensures that the list
is able to hold the largest odd number less than or equal to l.

The i'th odd number is easily calculated as 2*i-1.

Selecting the next prime, p, for sieving can be done in the same manner
as in Eratos1.lua, since the list still contains actual numbers.

To remove (set to zero) the numbers p*p, p*p+p, p*p+p+p, ... from the
list, we need to find the indexes of these numbers. Or some of them, at
least, since we are only dealing with odd numbers now.

In general, to find the index of an odd number, k, we calculate
(k+1)//2. And since p is odd, p*p is likewise odd and (p*p+1)//2 is the
index of p*p in the list. This explains the low limit of the loop that
iterates over the multiples of p.

It is clear that every second number in p*p, p*p+p, p*p+p+p, ... is even
and should not be used. And a bit of consideration shows that the
indexes of two neighboring odd numbers in the list p*p, p*p+p, p*p+p+p,
... differ by exactly p.  So p is used as the step in the sieve loop.

And running this:

  $ time lua-5.4.3 Eratos2.lua 1000000 100
  Eratos2.lua: 2021-Apr-09 19.50 / TN
  pi(1000000) = 78498, count=811068 (calculated 100 times)
  clockAccum.Init=5.577
  clockAccum.Sieve=6.092
  clockAccum.Pack=3.089
  clockTotal=14.758

  real    0m14.846s
  user    0m13.104s
  sys     0m1.684s
  $

This is really interesting, since it beats luo.lua by a significant
margin. I'll deal with this unexpected fact later. There is also some
instrumentation (the count and clock output) intended to help analyze
performance. Before doing that, however, let's take matters a step
further and specialize with respect to the prime 3, as well as the prime
2. This is the same level of specialization used in the Xuedong Luo
method. So here is Eratos3.lua doing that:

  $ cat Eratos3.lua
  -- Eratos3.lua: Inspired by "Jairo A. del Rio" <jairoadelrio6@gmail.com>: Tips and questions concerning a prime sieve in Lua, special case 2, 3 sieve
  -- 2021-Apr-01 18.59 / TN

  print( "Eratos3.lua: 2021-Apr-18 19.25 / TN" )

  local print = print
  local tonumber = tonumber
  local arg = arg
  local os = os
  local error = error

  _ENV = nil

  local function Eratos3( l, clockAccum )
    local clockStart
    local clockEnd
    clockStart = os.clock()
    local l0 = ( l + l%2 + 1 ) // 3
    local list = {}
    local u = 1
    local v = 4
    for i = 1, l0 do
      -- list[i] = 3*(i-1) + (i-1)%2 + 1
      list[i] = u
      u = u + v
      v = 6 - v
      -- print( "list[" .. i .. "] = " .. list[i] )
    end
    local x = 1
    local count = 0
    clockEnd = os.clock()
    clockAccum.Init = (clockAccum.Init or 0) + clockEnd - clockStart
    clockStart = clockEnd
    while true do
      local p
      repeat
        x = x + 1
        p = list[x]
      until p ~= 0
      if not p or p*p > l then
        break
      end
      local i = p*p // 3 + 1
      local q = p // 6
      local d
      if p % 6 == 1 then
        d = 8 * q + 1
      else
        d = 4 * q + 3
      end
      -- print( p )
      local z = 2 * p
      -- local c = 0
      while i <= l0 do
        list[i] = 0
        i = i + d
        d = z - d
        -- c = c + 1
        count = count + 1
      end
      -- print( "p=" .. p .. ", c=" .. c )
    end
    clockEnd = os.clock()
    clockAccum.Sieve = (clockAccum.Sieve or 0) + clockEnd - clockStart
    clockStart = clockEnd
    local prime = {}
    -- do return prime end
    for i = 2, 3 do
      if i <= l then prime[#prime+1] = i end
    end
    for i = 2, l0 do
      if list[i] ~= 0 then
        -- print( "prime[" .. i .. "] = " .. list[i] )
        prime[#prime+1] = list[i]
      end
    end
    clockEnd = os.clock()
    clockAccum.Pack = (clockAccum.Pack or 0) + clockEnd - clockStart
    return prime, count
  end

  local l = tonumber( arg[1] )
  local c = tonumber( arg[2] or 1 )
  local result
  local count
  local clockStart = os.clock()
  local clockAccum = {}
  for i = 1, c do
    result, count = Eratos3( l, clockAccum )
  end
  local clockEnd = os.clock()
  print( "pi(" .. l .. ") = " .. #result .. ", count=" .. count .. " (calculated " .. c .. " times)" )
  print( "clockAccum.Init=" .. clockAccum.Init )
  print( "clockAccum.Sieve=" .. clockAccum.Sieve )
  print( "clockAccum.Pack=" .. clockAccum.Pack )
  print( "clockTotal=" .. clockEnd - clockStart )
  $

The list should hold the numbers that are neither divisible by 2 nor 3,
so list[1]=1, list[2]=5, list[3]=7, list[4]=11, list[5]=13, list[6]=17,
list[7]=19, and so on. These are all the numbers of the form 6*k+1 and
6*k+5. And note that the difference between neighboring numbers
alternates between 4 and 2.

The number of elements in the list is calculated as

  l0 = ( l + l%2 + 1 ) // 3.

A bit of thought and experimenting serve to convince ourselves that the
list thereby becomes able to hold the largest number of the form 6*k+1
or 6*k+5 that are less than or equal to l.

The list itself is initialized by the values of u, initially 1, that are
formed by repeatedly adding v which alternates between 4 and 2. This is
similar to what happens in luo.lua and generates 1, 5, 7, 11, 13 ...

The sieve loop is more intricate: We need to map from the values that
should be excluded (p*p, p*p+p, ...) to their indexes in order to zero
the correct numbers in the list.

Observe initially that for the numbers in the list of the form 6*k+1,
the index is 2*k+1. And for the numbers of the form 6*k+5, the index is
2*k+2. For example, the index of 7=6*1+1 where k=1 is 2*k+1=3. And the
index of 17=6*2+5 with k=2 is 2*k+2=6.

If a number, n, is of one of the forms 6*k+1 or 6*k+5, we can convince
ourselves that its index in the list is n//3+1: If n = 6*k+1, n//3 = 2*k
and n//3+1 is the desired index value 2*k+1. And if n = 6*k+5, n//3 =
2*k+1 and again n//3+1 is the desired index value which is 2*k+2.

The index of p*p, the initial value of i, is therefore calculated as
p*p//3+1.

Now we have to zero the numbers with indexes that correspond to the
subset of the numbers p*p, p*p+p, p*p+2*p, ... which are not divisible
by 2 or 3. Let's begin by considering an example, p=5: For this p, the
numbers we wish to exclude are

  25 30 35 40 45 50 55 60 65 70 75 80 85 ...

>From this list, we should exclude numbers divisible by 2 or 3, that is,
numbers that are not of the form 6*k+1 or 6*k+5. These numbers are
already excluded from consideration because we have specialized with
respect to 2 and 3. Excluding numbers divisible by 2 or 3 leaves us with

  25 35 55 65 85 ...

The corresponding list of indexes (n//3+1 for each n in the list) is

  9 12 19 22 29 ...

So adding 3 and 7 alternatingly generates these indexes.

Another example, p=7: Excluding numbers divisible by 2 or 3 from

  49 56 63 70 77 84 91 98 105 112 119 126 133 ...

leaves

  49 77 91 119 133 ...

so the list of indexes to be zeroed is

  17 26 31 40 45 ...

Similar pattern as for p=5: Adding 9 and 5 alternatingly generates these
indexes.

In general, for a p of the form 6*k+1, adding d = 8*(p//6)+1 and 2*p-d
alternatingly generates the desired indexes. And for a p of the form
6*k+5, adding d = 4*(p//6)+3 and 2*p-d alternatingly does the job.
That's what happens in Eratos3.lua.

Running Eratos3.lua:

  $ time lua-5.4.3 Eratos3.lua 1000000 100
  Eratos3.lua: 2021-Apr-18 19.25 / TN
  pi(1000000) = 78498, count=429626 (calculated 100 times)
  clockAccum.Init=4.588
  clockAccum.Sieve=5.815
  clockAccum.Pack=2.561
  clockTotal=12.964

  real    0m13.024s
  user    0m11.575s
  sys     0m1.435s
  $

So Eratos3.lua is slightly faster than Eratos2.lua: The clockTotal for
Eratos2.lua reported above is 14.586.

Getting back to your (Jairo's) luo.lua, to enable a more detailed
comparison with Eratos3.lua, I have instrumented luo.lua in the same way
as Eratos3.lua. The difference between your original luo.lua and the
instrumented luo2.lua is:

  $ diff luo.lua luo2.lua
  4c4,7
  < function primesieve(N)
  ---
  > function primesieve(N,clockAccum)
  >     local clockStart
  >     local clockEnd
  >     clockStart = os.clock()
  22a26,29
  >     local count = 0
  >     clockEnd = os.clock()
  >     clockAccum.Init = (clockAccum.Init or 0) + clockEnd - clockStart
  >     clockStart = clockEnd
  32a40
  >             count = count + 1
  34a43,45
  >     clockEnd = os.clock()
  >     clockAccum.Sieve = (clockAccum.Sieve or 0) + clockEnd - clockStart
  >     clockStart = clockEnd
  40c51,53
  <     return result
  ---
  >     clockEnd = os.clock()
  >     clockAccum.Pack = (clockAccum.Pack or 0) + clockEnd - clockStart
  >     return result, count
  42a56,58
  > local count
  > local clockStart = os.clock()
  > local clockAccum = {}
  44c60
  <     X = #primesieve(1000000)
  ---
  >     X, count = primesieve(1000000,clockAccum)
  45a62,67
  > local clockEnd = os.clock()
  > print( "pi(" .. 1000000 .. ") = " .. #X .. ", count=" .. count .. " (calculated " .. 100 .. " times)" )
  > print( "clockAccum.Init=" .. clockAccum.Init )
  > print( "clockAccum.Sieve=" .. clockAccum.Sieve )
  > print( "clockAccum.Pack=" .. clockAccum.Pack )
  > print( "clockTotal=" .. clockEnd - clockStart )
  $

And running luo2.lua:

  $ time lua-5.4.3 luo2.lua
  pi(1000000) = 78498, count=580993 (calculated 100 times)
  clockAccum.Init=6.386
  clockAccum.Sieve=10.53
  clockAccum.Pack=4.955
  clockTotal=21.871

  real    0m22.000s
  user    0m20.326s
  sys     0m1.575s
  $

The first thing to notice here is that the figures reported by the time
command are not significantly larger than the corresponding figures
reported for the original luo.lua, repeated here:

  $ time lua-5.4.3 luo.lua

  real    0m21.828s
  user    0m20.373s
  sys     0m1.388s
  $

If there is a small difference, this is most likely caused primarily by
the count = count + 1 which has been added to the inner sieving loop.

Before continuing, some words about the instrumentation used: Resource
usage is measured by the Lua os.clock() function which is based on the
standard C library clock() function. To measure resource usage of some
section of the program, take the difference between the values of
os.clock() before and after that program section executes.

It is usually important to avoid too many calls to os.clock(), since
those calls and the accompanying calculations take time and may disturb
the measurements significantly. In any case, it is preferable to measure
fairly long-running program sections, since the clock() may "tick" with
quite large intervals, that is, be relatively inaccurate.

For both Eratos3.lua and luo2.lua, three os.clock() based measurements
are taken, one for each of the main sections: Initialization, sieving,
and packing of the result. And because the base test consists of 100
repetitions of the calculation of the primes below 1000000, we need to
accumulate the measurements for each section. This is handled, a bit
cumbersome, using the clockAccum table parameter which has been added to
the main sieve function.

In addition to the os.clock() measurements, the number of times the
inner sieve loop is executed is counted. There is no need for
accumulation here, so that value, the count, is simply returned as an
additional result of the main sieve function.

As a final check, the executions are performed under the control of the
time command which provides a simple sanity check for the figures
reported.

Now compare the measurements for luo2.lua and Eratos3.lua:

  $ time lua-5.4.3 luo2.lua
  pi(1000000) = 78498, count=580993 (calculated 100 times)
  clockAccum.Init=6.386
  clockAccum.Sieve=10.53
  clockAccum.Pack=4.955
  clockTotal=21.871

  real    0m22.000s
  user    0m20.326s
  sys     0m1.575s
  $

  $ time lua-5.4.3 Eratos3.lua 1000000 100
  Eratos3.lua: 2021-Apr-18 19.25 / TN
  pi(1000000) = 78498, count=429626 (calculated 100 times)
  clockAccum.Init=4.588
  clockAccum.Sieve=5.815
  clockAccum.Pack=2.561
  clockTotal=12.964

  real    0m13.024s
  user    0m11.575s
  sys     0m1.435s
  $

We notice several things: The initialization by luo2.lua is a couple of
seconds slower than the initialization by Eratos3.lua. Comparing the
code, luo2.lua has

    local step = 2
    local first = 5
    while first <= N do
        S[#S+1] = first
        first = first + step
        step = step == 2 and 4 or 2
    end

whereas Eratos3.lua uses:

  local u = 1
  local v = 4
  for i = 1, l0 do
    list[i] = u
    u = u + v
    v = 6 - v
  end

Adjusting luo2.lua to use step = 6-step should save a little.

Also the packing of final result, the list of primes, is slower in
luo2.lua that uses:

    for _, v in next, S do
        if v > 0 then
            result[#result + 1] = v
        end
    end

Where Eratos3.lua uses:

  for i = 2, l0 do
    if list[i] ~= 0 then
      -- print( "prime[" .. i .. "] = " .. list[i] )
      prime[#prime+1] = list[i]
    end
  end

Actually, the luo2.lua code is not only slower, but actually risks
generating a list of primes out of order: There is no guarantee that the
table keys are processed in numerical order when using next.

More mysterious initially is the fact that luo2.lua uses 580993
iterations of the inner sieve loop where Eratos3.lua can make do with
only 429626. The reason turns out to be that luo2.lua spends time
eliminating multiples of composite numbers: For example, multiples of 25
are eliminated, but this is unnecessary, since these multiples have all
been eliminated when multiples of 5 were eliminated.  Figuring out what
to do about this takes some thinking, but fortunately the fix is easy:
Simply refrain from executing the inner sieve loop if S[i] == 0: S[i] is
the value whose multiples are being zeroed, so if S[i] is zero, it is
composite and, hence, its multiples have already been excluded. (This
improvement is relevant also for the original Xuedon Luo method.)

Similar logic is found in Eratos3.lua in the form of the following loop:

    repeat
      x = x + 1
      p = list[x]
    until p ~= 0

Accordingly, the following adjustments to luo2.lua are made in luo3.lua
to improve performance:

  $ diff luo2.lua luo3.lua
  24c24
  <         step = step == 2 and 4 or 2
  ---
  >         step = 6 - step
  36,40c36,42
  <         while j <= M do
  <             S[j] = 0
  <             j = j + ij
  <             ij = t - ij
  <             count = count + 1
  ---
  >         if S[i] ~= 0 then
  >             while j <= M do
  >                 S[j] = 0
  >                 j = j + ij
  >                 ij = t - ij
  >                 count = count + 1
  >             end
  46,48c48,50
  <     for _, v in next, S do
  <         if v > 0 then
  <             result[#result + 1] = v
  ---
  >     for i = 1, #S do
  >         if S[i] > 0 then
  >             result[#result + 1] = S[i]
  $

Then:

  $ time lua-5.4.3 luo3.lua
  pi(1000000) = 78498, count=429627 (calculated 100 times)
  clockAccum.Init=5.828
  clockAccum.Sieve=9.941
  clockAccum.Pack=2.405
  clockTotal=18.174

  real    0m18.227s
  user    0m16.629s
  sys     0m1.575s
  $

We notice that all sections have been speeded up: The luo3.lua
initialization is still a bit slower than the Eratos3.lua
initialization: Presumably, this is explained by the different loop
contructs used. And the packing is performing similarly in luo3.lua and
Eroatos3.lua and understandably: The packing loops are equivalent.

The sieving section has been speeded up, but there is still some way to
go, which is a bit strange, considering that the luo3.lua sieve loop

            while j <= M do
                S[j] = 0
                j = j + ij
                ij = t - ij
                count = count + 1
            end

is statement for statement equivalent to the Eratos3.lua sieve loop:

    while i <= l0 do
      list[i] = 0
      i = i + d
      d = z - d
      -- c = c + 1
      count = count + 1
    end

The reason turns out to be that the luo.lua variable j is global. Hence,
we declare j local like this in luo4.lua:

  $ diff luo3.lua luo4.lua
  33c33
  <         j = c
  ---
  >         local j = c
  $

Resulting in:

  $ time lua-5.4.3 luo4.lua
  pi(1000000) = 78498, count=429627 (calculated 100 times)
  clockAccum.Init=6.005
  clockAccum.Sieve=5.818
  clockAccum.Pack=2.513
  clockTotal=14.336

  real    0m14.471s
  user    0m12.667s
  sys     0m1.715s
  $

Overall, this essentially eliminates the difference between the
performance of luo2.lua and Eratos3.lua, except for the initialization,
as noted.

Finally, one detail stands out: The count reported by both luo3.lua and
luo4.lua is one greater than the count reported by Eratos3.lua. Looking
into this, it turns out that the original Xuedong Luo method presupposes
that the N parameter is of the form 3*M+2 with odd M. So the odd M is =
2*k+1 for some k, so N = 3*(2*k+1)+2 = 6*k+5, and the N = 1000000 used
in the test is not of this form. As a result, the inner sieving loop of
luo.lua in rare cases sets a S[j] = 0 outside the intended range,
explaining the difference in the counts.

So, with these initial considerations out of the way, the original plan
was to present the very different module that I mentioned earlier,
Sieve1.lua.  However, this mail is already too long, so I'll limit
myself to simply attaching the code (Sieve1_20210507_1322.zip containing
Sieve1.lua and required utility modules Integer1.lua and List1.lua) and
just comment briefly.

To run the code, unpack the files into some directory and:

  $ time lua-5.4.3 -e'local s=require "Sieve1"; for n=1,100 do ps=s.primesTo( 1000000 ) end print( "moduleCount=" .. s.moduleCount .. ", #ps=" .. #ps )'
  moduleCount=3, #ps=78498

  real    0m11.119s
  user    0m10.872s
  sys     0m0.249s
  $

This mimics the luo.lua calculation by calculating the primes to 1000000
a hundred times. (And note: This is slightly faster than both luo4.lua
and Eratos3.lua.)

Sieve1.lua is different from Eratos3.lua in (at least) two significant
ways:

1. The number of primes (called modules) handled specially (like 2 and 3
   in luo.lua and Eratos3.lua) can be varied. This allows easy
   experimentation with this parameter. The module count is hardcoded in
   Sieve1.lua, presently Sieve1.moduleCount = 3, so that the primes
   handled specially are 2, 3, and 5. This particular value of the
   module count gives the best performance for the primes below 1000000
   in that both using 2 and 4 modules is slower. This minimum value may
   change with the number of primes desired, however, so using 3 is
   perhaps not the end of the story.

2. The Sieve1.lua module uses a single bit (with 0 meaning potentially
   prime and 1 denoting composite) to represent an entry in the list of
   numbers on which exclusion is performed. This means that (under
   ordinary circumstances) 64 numbers (bits of a Lua integer) are
   handled in parallel. But beware that for larger primes, there are far
   between the exclusion values, reducing this advantage, perhaps even
   turning it into a disadvantage.

Enough for now. Have fun. I did.

Best
Thorkil Naur

Attachment: Sieve1_20210507_1322.zip
Description: Zip compressed data