# Generic Input Algorithms  wiki

### Generic Functions and Algorithms in Lua

This describes the facilities of the `func` library, which was designed to make working with input more intuitive. The source can be found at Files:wiki_insecure/func.lua.

There are two special input iterators, `numbers()` and `words()`, which work like the very useful `io.lines()` iterator. To print all words found in the standard input:

```-- words.lua
require 'func'
for w in words() do
print(w)
end
```
To test this out on some text, you would say at the OS command prompt:
```\$ lua words.lua < test.txt
```
Printing out the values generated by an iterator is a very common operation, so `func` supplies the very convenient function `printall()`. It writes out all members of a sequence to standard output. By default, it puts out 7 items per line, separated by spaces, but you can optionally change these. In this case we do want each value on its own line:
```printall(words(),'\n')
```
`numbers()` creates a sequence of all numbers found in its input. For example, to sum all input numbers:
```require 'func'
local s
for x in numbers() do
s = s + x
end
print(s)
```
Summation is a common operation when analyzing data so `func` defines a generic `sum()` function. It returns the sum and the number of fields, so it's easy to calculate the average.
```local s,n = sum(numbers())
print('average value =',s/n)
```
Note that these iterators look for the appropriate patterns, so they don't depend on words or numbers being separated by spaces. `numbers()` will find everything that looks like a number in a file, and will safely ignore anything else. So it's useful for heavily commented data or output files. These iterators take an optional extra parameter, which can be a file or a string. For example, to print the sum and number of items in a file passed as a command-line parameter:
```f = io.open(arg)
print(sum(numbers(f)))
f:close()
```
It is useful to collect the output of an iterator as a table. Since it's so straightforward and instructive, here is a simplified definition of `copy()`:
```function copy(iter)
local res = {}
local k = 1
for v in iter do
res[k] = v
k = k + 1
end
return res
end
```
The next example makes an array of all the numbers found in a string. Obviously this is a trivial case, but it's common to have to extract numbers from strings, which can be tricky. In particular, this makes sure that they actually get converted - I've been bit more than once by the simple fact that arr['1'] and arr are not the same!
```t = copy(numbers '10 20 30') -- will be {10,20,30}
s = sum(list(t))             -- will be 60
```
Note the table sequence adaptor `list()`, which allows tables to be used as sequences. It's common to operate on arrays with these functions, so if you do pass them a table `list()` will automatically be assumed. To print out an array of numbers in a particular format, one can use something like `printall(t,' ',5,'%7.3f')` to format them nicely. Here's an implementation of the system command sort, that uses the `printall()` function to output each value of a sequence. I can't simply say `table.foreach(t,print)`, because that operation passes both the index and the value, so I would in effect get line numbers as well!
```t = copy(io.lines())
table.sort(t)
printall(t,'\n')   -- try table.foreach(t,print) and see!
```
Using the `sort()` function this becomes a one-liner:
```printall(sort(io.lines()),'\n')
```
You can iterate over part of a sequence with `slice()`. This is passed an iterator, a start index, and an item count. For example, this is a simple version of the `head` command; it shows the first ten lines of its input.
```printall(slice(io.lines(),1,10),'\n')
```
Sometimes we just want to count a sequence; for example this is a complete script that counts all the words in a file:
```require 'func'
print(count(words()))
```
In this form, `count()` is not so useful. But it can be given a function to select the items to be counted. For example, this gives me a rough idea of how many public functions there are in a Lua file. (If I didn't tie the match down to the begining, it would pick up local and anonymous functions as well)
```require 'func'
print(count(io.lines(),matching '^%s*function'))
```
where `matching()` is the following simple function. It creates a closure (a function bound to a local context) which is called for each item in the sequence:
```function matching(s)
local strfind = string.find
return function(v)
return strfind(v,s)
end
end
```

You can of course use any sequence in these operations. If you have the very useful lfs (Lua File System) library loaded, then `t = copy_if(lfs.dir(path),matching '%.cpp\$')` will fill a list with all files with extension `.cpp` from `path`.

Another useful way to modify `count()`'s input is to use `unique()`

```-- number of unique words in a file
print(count(unique(words())))
```
`unique()` is not implemented in the usual way, which requires the sequence to be first sorted. Instead, it uses `count_map()` to create a map where the keys are the items and the values are the counts. The rest is easy once we have `keys()`, which is the complementary brother of `list()`:
```function unique(iter)
local t = count_map(iter)
return keys(t)
end
```
The classic 'count occurances of words in a file' is:
```table.foreach(count_map(words()),print)
```
When comparing two sequences, it's useful to `join()` them together. This will print out the differences between two files:
```for x,y in join(numbers(f1),numbers(f2)) do
print(x-y)
end
```

### An AWK programming style in Lua

Before I discovered Lua, AWK was my favourite language for manipulating text files. (I even managed to convert some colleagues using the slogan 'AWK is the command-line equivalent of Excel'.) To give you a taste, here is a complete AWK program to print out the first and third columns of a file, scaling using the fourth column - note that the loop over all lines is implicit:
```{ print \$1/\$4, \$3/\$4 }
```
The `func` library supplies the iterator `fields()` for this purpose. Here is the equivalent Lua code:
```for x,y,z in fields{1,3,4} do
print(x/z,y/z)
end
```
Here is my current favourite one-liner. It counts how many values in the 7th column are greater than 44000, and is about half the speed of the equivalent AWK program (run using MAWK). That's not bad, considering how optimized AWK is for its specialized task!
```print(count(fields{7},greater_than(44000)))
```
```{ if (\$7 > 44000) k++ } END { print(k) }
```
`fields()` can work with any input delimiter. This reads a set of values from a comma-separated file - note that passing n instead of a list of field ids is equivalent to {1,2,...n}:
```for x,y in fields({1,2},',',f) do ...
```
```for x,y in fields(2,',',f) do ...  --equivalent--
```

### Efficiency in performance and expression

I think it's clear that common operations can be very concisely expressed using this generic programming style, but there are two reservations people tend to have at this point, certainly from my C++ experience with the STL. The first objection is that the functional style is more inefficient. That's certainly true in theory, but how inefficient in practice? For instance, here is a transcript of using the sequence `random()` to create a table with random values:
```> tt = copy(random(10000))
> = sum(tt)
5039.542771691  10000
```
These operations happen pretty instantaneously on my elderly laptop, and I'm only starting to notice at 1e5 items. For 1e6 items, the first operation takes 2.14 seconds, compared to 2.08 seconds for the explicit loop! That goes down to 1.92 if I'm careful to use locals, so the optimal explicit version is:
```local t = {}
local random = math.random
for i = 1,1e6 do
t[i] = random()
end
```
This case shows that there's no compelling speed advantage to doing it the long way. (I chose this example precisely because it doesn't involve file i/o, which tends to dominate the `words()` and `numbers()` run-times.) The advantage is that there is less code to go wrong; the generic programming people believe that explicit loops are "tedious and error-prone", as Stroustrup says.

The second objection is that it leads to weird and unnatural code. This can certainly be the case in C++, which (let's face it) is not really suited to the functional style; there are no closures, and rigid static typing constantly gets in the way, requiring everything to be templates. The style suits Lua much better - doing this in C++ would not be half as readable, even using the Boost Lambda library:

```-- sum of squares of input data using an internal iterator
for_each(numbers(),function(v)
s = s + v*v
end)
```
```-- sum of squares of input data using an external iterator
for v in numbers() do
s = s + v*v
end
```
The idea is not to replace all loops, just the common generic patterns. Such code becomes easier to read, because any explicit loops will stand out more. Lua is particularly suited to this style, which often seems forced in C++.

### Writing Custom Input Objects

If `f` is not a string, then `words(f)` will use the file object `f`. In fact, `f` can be any object which has a `read` method. All that the code assumes is that `f:read()` will return the next line of input text. Here is a more involved example, where I've created a class `Files` which allows us to read from a list of files. The obvious application is to mimic AWK's behaviour, where every file on the command-line becomes part of standard input.
```Files = {}

function Files.create(list)
local files = {}
files.list = {}
local n = table.getn(list)
for i = 1,n do
files.list[i] = list[i]
end
files.open_next = Files.open_next
files:open_next()
return files
end

function Files:open_next()
if self.f then self.f:close() end
local nf = table.remove(self.list,1)
if nf then
self.f = io.open(nf)
return true
else
self.f = nil
return false
end
end

local ret = self.f:read()
if not ret then
if not self:open_next() then return nil end
else
return ret
end
end
```
I need to explain an apparent inconsistency. After praising the joys of loopless programming, there's a classic copy-table loop in `Files.create()`. Lua programs are passed a global table called `arg`, which has the command-line arguments, `arg`,`arg`, etc. But there's also `arg`, the script name, and `arg[-1]` the actual program name. The explicit loop in question is to make quite sure we don't copy those fields!
```files = Files.create(arg)
printall(words(files))
```

### Notes on Implementation and Further Development

Most of `func` is a straightforward variation on the same theme; iterators and functions as closures. Section 7.1 of PiL [ref?] explains the issues well, and I've used the `allwords` example as the basis for `words()` and `numbers()`. `fields()` was first implemented in a naive way, grabbing each field in turn, but later was implemented as one call to `string.find()`, by creating a custom regular expression. For example, if fields 1 and 3 are needed, separated by commas, then the regexp looks like this - fields are defined as anything which is not a comma, and we capture using () the required fields.
```'%s*([^,]+),[^,]+,([^,])'
```
The concept of sequences is very general, which means that it's easy to use the `func` operations with any library which provides an iterator. And that often simplifies code tremendously. For instance, here is how it can be with luasql. Consider the canonical way to access all rows of a query result:
```cur = con:execute 'SELECT * FROM [Event Summaries]'
mag = -9
row = cur:fetch({},'a')
while row do
if row.Magnitude > mag then
mag = row.Magnitude
end
row = cur:fetch(row,'n')
end
cur:close()
```

I can simply this by creating an iterator which keeps track of `row`:

```function rows(cursor)
local row = {}
return function()
return cursor:fetch(row,'a')
end
end

for row in rows(cur) do
if row.Magnitude > mag then
mag = row.Magnitude
end
end
```
That's already a better loop, since we don't have to call `cursor:fetch` twice and look after a local `row`. We can also implement an equivalent to `fields`:
```function column(fieldname,cursor)
local row = {}
return function()
row = cur:fetch(row,'a')
if not row then return nil
else return row[fieldname]
end
end
end

local minm,maxm = minmax(column('Magnitude',cur))
```
There's no longer any explicit loops! It is of course often more efficient to use SQL WHERE clauses to constrain the sequence. The following works, but isn't the optimal way to do the job:
```print(count(column('Magnitude',cur),greater_than(2)))
```

RecentChanges · preferences
edit · history
Last edited July 21, 2007 5:49 pm GMT (diff)