Lazy Sort

wiki

Inside quick sort, there is a binary tree waiting to be discovered.

The basic quick sort algorithm looks like this:

To quicksort(Set)
  If Set is empty, then return an empty vector
  Otherwise
    Select a value in Set, Pivot
    Make Left from all values in Set less than Pivot
    Make Right from all values in Set greater than Pivot
    Return the concatenation of Quicksort(Left), Pivot, Quicksort(Right)

The binary tree becomes quite obvious if the concatenation in the last step is replaced by making a binary tree node. The final vector can then be created by flattening the binary tree.

Similarly, we could compute factorial(n) by first making a vector of integers from 1 to n and then multiplying the elements of the vector.

In both cases, we have a kind of "virtual" intermediate structure, which we actually eliminate in practice by incorporating it into program flow. [1] However, it can be useful sometimes to reveal the intermediate structure.

For example, suppose we have a very large vector of family incomes, say of a million families. We want to compute some measure of income inequality; a common one is to find the ratio between the average income of the highest quintile (that is, the 20% of the population with the highest incomes) and the lowest quintile (that is, the 20% of the population with the lowest incomes).

To compute this value, we need to semi-sort the vector of incomes. Ideally, we want to just sort it enough so that we can extract the (unordered) first and last quintiles. One general-purpose way of doing this is to run the quick sort algorithm lazily; first, we stop when we reach the node corresponding to the 20% point; and then we start again, stopping when we reach the 80% point. To do that (in general), we need keep at least part of the implicit binary tree around.

One way of doing this is to change quicksort so that the recursion is replaced by a "promise"; that is, instead of actually doing the recursion, we make a function closure which when called will do the next recursion step.

The Lua distribution includes sort.lua which has a simple implementation of quick sort; slightly simplified, the core is as follows:

function qsort(vec, low, high)
  if low < high then
    local middle = partition(vec, low, high)
    qsort(vec, low, middle-1)
    qsort(vec, middle+1, high)
  end
end

I've separated out the partition function and given variables possibly more meaningful names to make the algorithm a little clearer.

The lazy version looks something like this:

-- Ensure that that the element at i is in the right position,
-- and return a closure which can be used for continuing the sort.
function quicksorter(i, vec, low, high)
  if low >= high then
    return quicksorter
  else -- low < high
    -- partition the vector and initialize the child closures
    local middle = partition(vec, low, high)
    local left, right = quicksorter

    -- Create the promise
    local function self(i, vec, low, high)
      if i < middle then
        left = left(i, vec, low, middle-1)
        return self
      elseif i > middle then
        right = right(i, vec, middle+1, high)
        return self
      end
    end
    
    -- Force the promise until i is in the right position
    return self(i, vec, low, high)
  end
end

function lazysort(vec, low, high)
  local sorter = quicksorter
  return function(i)
    sorter = sorter(i, vec, low, high)
    return vec[i]
  end
end

I haven't actually tested the code above, but I did write a real version of lazysort; you can download it from [here]. The real version includes a couple of efficiencies: first, it uses an insertion sort for small ranges; and, second, it merges the binary tree so that it is always complete.

The sample code includes a test suite and benchmark, and some commented out pieces which make it possible to view the "virtual" binary tree, by recursing down the closures with a sentinel index value. I used that to make the following illustration of how it works:

First, let's test it:

function rangemean(v, low, high)
  local sum = v[low]
  for i = low+1, high do sum = sum + v[i] end
  return sum / (high - low + 1)
end

function disparity(v)
  local getter = lazysort(v)
  local bottom = math.ceil(#v / 5)
  getter(bottom)
  local top = math.floor(#v * 4 / 5)
  getter(top)
  return rangemean(v, top, #v) / rangemean(v, 1, bottom)
end

> math.randomseed(31415926)
> test = {}; for i = 1, 1e6 do test[i] = math.random(1e4) end
> now = os.clock(); print(disparity(test)); print(os.clock() - now)
8.9768793870336
0.6796875
>
> -- By comparison:
>
> math.randomseed(31415926)
> test = {}; for i = 1, 1e6 do test[i] = math.random(1e4) end
> now = os.clock(); table.sort(test); print (rangemean(test,8e5,1e6)/rangemean(test,1,2e5)); print(os.clock() - now)
8.9768793870336
1.8828125

So even though it's written in Lua and table.sort is written in C, it's three times as fast for this computation. In other words, it's not a toy.

Here's how it works. Let's take a slightly smaller sample:

> test = {}; for i = 1, 500 do test[i] = math.random(1e4) end
> getter = lazysort(test)
> =getter(100)
1954
>
> -- By uncommenting the viewing code, we can look at the tree structure:
> getter""
Root
+-Sorted [1, 1)
+-Node [1, 501)
| +-Node [1, 249)
| | +-Node [1, 169)
| | | +-Leaf [1, 47) Unsorted
| | | +-Sorted [47, 48)
| | | +-Node [48, 169)
| | |   +-Leaf [48, 83) Unsorted
| | |   +-Sorted [83, 103)
| | |   +-Leaf [103, 169) Unsorted
| | +-Sorted [169, 170)
| | +-Leaf [170, 249) Unsorted
| +-Sorted [249, 250)
| +-Leaf [250, 501) Unsorted
+-Sorted [501, 500)
> 
> -- "Partitioned" would be a better word than "Unsorted". 
> -- Now, let's ask for another element, some distance away
>
> =getter(400)
7877
> getter""
Root
+-Sorted [1, 1)
+-Node [1, 501)
| +-Node [1, 249)
| | +-Node [1, 169)
| | | +-Leaf [1, 47) Unsorted
| | | +-Sorted [47, 48)
| | | +-Node [48, 169)
| | |   +-Leaf [48, 83) Unsorted
| | |   +-Sorted [83, 103)
| | |   +-Leaf [103, 169) Unsorted
| | +-Sorted [169, 170)
| | +-Leaf [170, 249) Unsorted
| +-Sorted [249, 250)
| +-Node [250, 501)
|   +-Leaf [250, 382) Unsorted
|   +-Sorted [382, 383)
|   +-Node [383, 501)
|     +-Node [383, 441)
|     | +-Leaf [383, 391) Unsorted
|     | +-Sorted [391, 408)
|     | +-Leaf [408, 441) Unsorted
|     +-Sorted [441, 442)
|     +-Leaf [442, 501) Unsorted
+-Sorted [501, 500)
>
> -- Leaf [250, 501) has now been expanded down to element 400.
>
> -- Now, let's watch as it merges tree nodes.
>
> =getter(387)
7546
> getter""
Root
+-Sorted [1, 1)
+-Node [1, 501)
| +-Node [1, 249)
| | +-Node [1, 169)
| | | +-Leaf [1, 47) Unsorted
| | | +-Sorted [47, 48)
| | | +-Node [48, 169)
| | |   +-Leaf [48, 83) Unsorted
| | |   +-Sorted [83, 103)
| | |   +-Leaf [103, 169) Unsorted
| | +-Sorted [169, 170)
| | +-Leaf [170, 249) Unsorted
| +-Sorted [249, 250)
| +-Node [250, 501)
|   +-Leaf [250, 382) Unsorted
|   +-Sorted [382, 408)
|   +-Node [408, 501)
|     +-Leaf [408, 441) Unsorted
|     +-Sorted [441, 442)
|     +-Leaf [442, 501) Unsorted
+-Sorted [501, 500)
>
> Leaf [383, 391] has been fully sorted, so it can be merged into it's parent, Node [383, 441).
> That in turn creates another merge, leaving the tree as shown above.

[1] For a slightly more academic (but still accessible) discussion of this, read Lex Augusteijn's interesting paper on [Sorting Morphisms.]

RecentChanges · preferences
edit · history
Last edited May 12, 2008 12:13 pm GMT (diff)