# Lazy Sort  wiki

Inside quick sort, there is a binary tree waiting to be discovered.

The basic quick sort algorithm looks like this:

```To quicksort(Set)
If Set is empty, then return an empty vector
Otherwise
Select a value in Set, Pivot
Make Left from all values in Set less than Pivot
Make Right from all values in Set greater than Pivot
Return the concatenation of Quicksort(Left), Pivot, Quicksort(Right)
```

The binary tree becomes quite obvious if the concatenation in the last step is replaced by making a binary tree node. The final vector can then be created by flattening the binary tree.

Similarly, we could compute factorial(n) by first making a vector of integers from 1 to n and then multiplying the elements of the vector.

In both cases, we have a kind of "virtual" intermediate structure, which we actually eliminate in practice by incorporating it into program flow.  However, it can be useful sometimes to reveal the intermediate structure.

For example, suppose we have a very large vector of family incomes, say of a million families. We want to compute some measure of income inequality; a common one is to find the ratio between the average income of the highest quintile (that is, the 20% of the population with the highest incomes) and the lowest quintile (that is, the 20% of the population with the lowest incomes).

To compute this value, we need to semi-sort the vector of incomes. Ideally, we want to just sort it enough so that we can extract the (unordered) first and last quintiles. One general-purpose way of doing this is to run the quick sort algorithm lazily; first, we stop when we reach the node corresponding to the 20% point; and then we start again, stopping when we reach the 80% point. To do that (in general), we need keep at least part of the implicit binary tree around.

One way of doing this is to change quicksort so that the recursion is replaced by a "promise"; that is, instead of actually doing the recursion, we make a function closure which when called will do the next recursion step.

The Lua distribution includes `sort.lua` which has a simple implementation of quick sort; slightly simplified, the core is as follows:

```function qsort(vec, low, high)
if low < high then
local middle = partition(vec, low, high)
qsort(vec, low, middle-1)
qsort(vec, middle+1, high)
end
end
```

I've separated out the `partition` function and given variables possibly more meaningful names to make the algorithm a little clearer.

The lazy version looks something like this:

```-- Ensure that that the element at i is in the right position,
-- and return a closure which can be used for continuing the sort.
function quicksorter(i, vec, low, high)
if low >= high then
return quicksorter
else -- low < high
-- partition the vector and initialize the child closures
local middle = partition(vec, low, high)
local left, right = quicksorter

-- Create the promise
local function self(i, vec, low, high)
if i < middle then
left = left(i, vec, low, middle-1)
return self
elseif i > middle then
right = right(i, vec, middle+1, high)
return self
end
end

-- Force the promise until i is in the right position
return self(i, vec, low, high)
end
end

function lazysort(vec, low, high)
local sorter = quicksorter
return function(i)
sorter = sorter(i, vec, low, high)
return vec[i]
end
end
```

I haven't actually tested the code above, but I did write a real version of lazysort; you can download it from [here]. The real version includes a couple of efficiencies: first, it uses an insertion sort for small ranges; and, second, it merges the binary tree so that it is always complete.

The sample code includes a test suite and benchmark, and some commented out pieces which make it possible to view the "virtual" binary tree, by recursing down the closures with a sentinel index value. I used that to make the following illustration of how it works:

First, let's test it:

```function rangemean(v, low, high)
local sum = v[low]
for i = low+1, high do sum = sum + v[i] end
return sum / (high - low + 1)
end

function disparity(v)
local getter = lazysort(v)
local bottom = math.ceil(#v / 5)
getter(bottom)
local top = math.floor(#v * 4 / 5)
getter(top)
return rangemean(v, top, #v) / rangemean(v, 1, bottom)
end
```

```> math.randomseed(31415926)
> test = {}; for i = 1, 1e6 do test[i] = math.random(1e4) end
> now = os.clock(); print(disparity(test)); print(os.clock() - now)
8.9768793870336
0.6796875
>
> -- By comparison:
>
> math.randomseed(31415926)
> test = {}; for i = 1, 1e6 do test[i] = math.random(1e4) end
> now = os.clock(); table.sort(test); print (rangemean(test,8e5,1e6)/rangemean(test,1,2e5)); print(os.clock() - now)
8.9768793870336
1.8828125
```

So even though it's written in Lua and table.sort is written in C, it's three times as fast for this computation. In other words, it's not a toy.

Here's how it works. Let's take a slightly smaller sample:

```> test = {}; for i = 1, 500 do test[i] = math.random(1e4) end
> getter = lazysort(test)
> =getter(100)
1954
>
> -- By uncommenting the viewing code, we can look at the tree structure:
> getter""
Root
+-Sorted [1, 1)
+-Node [1, 501)
| +-Node [1, 249)
| | +-Node [1, 169)
| | | +-Leaf [1, 47) Unsorted
| | | +-Sorted [47, 48)
| | | +-Node [48, 169)
| | |   +-Leaf [48, 83) Unsorted
| | |   +-Sorted [83, 103)
| | |   +-Leaf [103, 169) Unsorted
| | +-Sorted [169, 170)
| | +-Leaf [170, 249) Unsorted
| +-Sorted [249, 250)
| +-Leaf [250, 501) Unsorted
+-Sorted [501, 500)
>
> -- "Partitioned" would be a better word than "Unsorted".
> -- Now, let's ask for another element, some distance away
>
> =getter(400)
7877
> getter""
Root
+-Sorted [1, 1)
+-Node [1, 501)
| +-Node [1, 249)
| | +-Node [1, 169)
| | | +-Leaf [1, 47) Unsorted
| | | +-Sorted [47, 48)
| | | +-Node [48, 169)
| | |   +-Leaf [48, 83) Unsorted
| | |   +-Sorted [83, 103)
| | |   +-Leaf [103, 169) Unsorted
| | +-Sorted [169, 170)
| | +-Leaf [170, 249) Unsorted
| +-Sorted [249, 250)
| +-Node [250, 501)
|   +-Leaf [250, 382) Unsorted
|   +-Sorted [382, 383)
|   +-Node [383, 501)
|     +-Node [383, 441)
|     | +-Leaf [383, 391) Unsorted
|     | +-Sorted [391, 408)
|     | +-Leaf [408, 441) Unsorted
|     +-Sorted [441, 442)
|     +-Leaf [442, 501) Unsorted
+-Sorted [501, 500)
>
> -- Leaf [250, 501) has now been expanded down to element 400.
>
> -- Now, let's watch as it merges tree nodes.
>
> =getter(387)
7546
> getter""
Root
+-Sorted [1, 1)
+-Node [1, 501)
| +-Node [1, 249)
| | +-Node [1, 169)
| | | +-Leaf [1, 47) Unsorted
| | | +-Sorted [47, 48)
| | | +-Node [48, 169)
| | |   +-Leaf [48, 83) Unsorted
| | |   +-Sorted [83, 103)
| | |   +-Leaf [103, 169) Unsorted
| | +-Sorted [169, 170)
| | +-Leaf [170, 249) Unsorted
| +-Sorted [249, 250)
| +-Node [250, 501)
|   +-Leaf [250, 382) Unsorted
|   +-Sorted [382, 408)
|   +-Node [408, 501)
|     +-Leaf [408, 441) Unsorted
|     +-Sorted [441, 442)
|     +-Leaf [442, 501) Unsorted
+-Sorted [501, 500)
>
> Leaf [383, 391] has been fully sorted, so it can be merged into it's parent, Node [383, 441).
> That in turn creates another merge, leaving the tree as shown above.
```

 For a slightly more academic (but still accessible) discussion of this, read Lex Augusteijn's interesting paper on [Sorting Morphisms.]

RecentChanges · preferences
edit · history
Last edited May 12, 2008 11:13 am GMT (diff)