Generic Input Algorithms

lua-users home
wiki

Generic Functions and Algorithms in Lua

This describes the facilities of the func library, which was designed to make working with input more intuitive. The source can be found at Files:wiki_insecure/func.lua.

There are two special input iterators, numbers() and words(), which work like the very useful io.lines() iterator. To print all words found in the standard input:

-- words.lua
require 'func'
for w in words() do
  print(w)
end
To test this out on some text, you would say at the OS command prompt:
$ lua words.lua < test.txt
Printing out the values generated by an iterator is a very common operation, so func supplies the very convenient function printall(). It writes out all members of a sequence to standard output. By default, it puts out 7 items per line, separated by spaces, but you can optionally change these. In this case we do want each value on its own line:
printall(words(),'\n')
numbers() creates a sequence of all numbers found in its input. For example, to sum all input numbers:
require 'func'
local s 
for x in numbers() do
  s = s + x
end
print(s)
Summation is a common operation when analyzing data so func defines a generic sum() function. It returns the sum and the number of fields, so it's easy to calculate the average.
local s,n = sum(numbers())
print('average value =',s/n)
Note that these iterators look for the appropriate patterns, so they don't depend on words or numbers being separated by spaces. numbers() will find everything that looks like a number in a file, and will safely ignore anything else. So it's useful for heavily commented data or output files. These iterators take an optional extra parameter, which can be a file or a string. For example, to print the sum and number of items in a file passed as a command-line parameter:
f = io.open(arg[1])
print(sum(numbers(f)))
f:close()
It is useful to collect the output of an iterator as a table. Since it's so straightforward and instructive, here is a simplified definition of copy():
function copy(iter)
  local res = {}
  local k = 1
  for v in iter do
     res[k] = v
     k = k + 1
  end
  return res
end
The next example makes an array of all the numbers found in a string. Obviously this is a trivial case, but it's common to have to extract numbers from strings, which can be tricky. In particular, this makes sure that they actually get converted - I've been bit more than once by the simple fact that arr['1'] and arr[1] are not the same!
t = copy(numbers '10 20 30') -- will be {10,20,30}
s = sum(list(t))             -- will be 60
Note the table sequence adaptor list(), which allows tables to be used as sequences. It's common to operate on arrays with these functions, so if you do pass them a table list() will automatically be assumed. To print out an array of numbers in a particular format, one can use something like printall(t,' ',5,'%7.3f') to format them nicely. Here's an implementation of the system command sort, that uses the printall() function to output each value of a sequence. I can't simply say table.foreach(t,print), because that operation passes both the index and the value, so I would in effect get line numbers as well!
t = copy(io.lines())
table.sort(t)
printall(t,'\n')   -- try table.foreach(t,print) and see!
Using the sort() function this becomes a one-liner:
printall(sort(io.lines()),'\n')
You can iterate over part of a sequence with slice(). This is passed an iterator, a start index, and an item count. For example, this is a simple version of the head command; it shows the first ten lines of its input.
printall(slice(io.lines(),1,10),'\n')
Sometimes we just want to count a sequence; for example this is a complete script that counts all the words in a file:
require 'func'
print(count(words()))
In this form, count() is not so useful. But it can be given a function to select the items to be counted. For example, this gives me a rough idea of how many public functions there are in a Lua file. (If I didn't tie the match down to the begining, it would pick up local and anonymous functions as well)
require 'func'
print(count(io.lines(),matching '^%s*function'))
where matching() is the following simple function. It creates a closure (a function bound to a local context) which is called for each item in the sequence:
function matching(s)
  local strfind = string.find
  return function(v)
    return strfind(v,s)
  end
end

You can of course use any sequence in these operations. If you have the very useful lfs (Lua File System) library loaded, then t = copy_if(lfs.dir(path),matching '%.cpp$') will fill a list with all files with extension .cpp from path.

Another useful way to modify count()'s input is to use unique()

-- number of unique words in a file
print(count(unique(words())))
unique() is not implemented in the usual way, which requires the sequence to be first sorted. Instead, it uses count_map() to create a map where the keys are the items and the values are the counts. The rest is easy once we have keys(), which is the complementary brother of list():
function unique(iter)
  local t = count_map(iter)
  return keys(t)
end
The classic 'count occurances of words in a file' is:
table.foreach(count_map(words()),print)
When comparing two sequences, it's useful to join() them together. This will print out the differences between two files:
for x,y in join(numbers(f1),numbers(f2)) do
  print(x-y)
end

An AWK programming style in Lua

Before I discovered Lua, AWK was my favourite language for manipulating text files. (I even managed to convert some colleagues using the slogan 'AWK is the command-line equivalent of Excel'.) To give you a taste, here is a complete AWK program to print out the first and third columns of a file, scaling using the fourth column - note that the loop over all lines is implicit:
{ print $1/$4, $3/$4 }
The func library supplies the iterator fields() for this purpose. Here is the equivalent Lua code:
for x,y,z in fields{1,3,4} do
   print(x/z,y/z)
end
Here is my current favourite one-liner. It counts how many values in the 7th column are greater than 44000, and is about half the speed of the equivalent AWK program (run using MAWK). That's not bad, considering how optimized AWK is for its specialized task!
print(count(fields{7},greater_than(44000)))
{ if ($7 > 44000) k++ } END { print(k) }
fields() can work with any input delimiter. This reads a set of values from a comma-separated file - note that passing n instead of a list of field ids is equivalent to {1,2,...n}:
for x,y in fields({1,2},',',f) do ...
for x,y in fields(2,',',f) do ...  --equivalent--

Efficiency in performance and expression

I think it's clear that common operations can be very concisely expressed using this generic programming style, but there are two reservations people tend to have at this point, certainly from my C++ experience with the STL. The first objection is that the functional style is more inefficient. That's certainly true in theory, but how inefficient in practice? For instance, here is a transcript of using the sequence random() to create a table with random values:
> tt = copy(random(10000))
> = sum(tt)
5039.542771691  10000
These operations happen pretty instantaneously on my elderly laptop, and I'm only starting to notice at 1e5 items. For 1e6 items, the first operation takes 2.14 seconds, compared to 2.08 seconds for the explicit loop! That goes down to 1.92 if I'm careful to use locals, so the optimal explicit version is:
local t = {}
local random = math.random
for i = 1,1e6 do
   t[i] = random()
end
This case shows that there's no compelling speed advantage to doing it the long way. (I chose this example precisely because it doesn't involve file i/o, which tends to dominate the words() and numbers() run-times.) The advantage is that there is less code to go wrong; the generic programming people believe that explicit loops are "tedious and error-prone", as Stroustrup says.

The second objection is that it leads to weird and unnatural code. This can certainly be the case in C++, which (let's face it) is not really suited to the functional style; there are no closures, and rigid static typing constantly gets in the way, requiring everything to be templates. The style suits Lua much better - doing this in C++ would not be half as readable, even using the Boost Lambda library:

-- sum of squares of input data using an internal iterator
for_each(numbers(),function(v)
    s = s + v*v
end)
-- sum of squares of input data using an external iterator
for v in numbers() do
    s = s + v*v
end
The idea is not to replace all loops, just the common generic patterns. Such code becomes easier to read, because any explicit loops will stand out more. Lua is particularly suited to this style, which often seems forced in C++.

Writing Custom Input Objects

If f is not a string, then words(f) will use the file object f. In fact, f can be any object which has a read method. All that the code assumes is that f:read() will return the next line of input text. Here is a more involved example, where I've created a class Files which allows us to read from a list of files. The obvious application is to mimic AWK's behaviour, where every file on the command-line becomes part of standard input.
Files = {}
 
function Files.create(list)
   local files = {}
   files.list = {}
   local n = table.getn(list)
   for i = 1,n do
      files.list[i] = list[i]
   end
   files.open_next = Files.open_next
   files.read = Files.read
   files:open_next()
   return files
end
 
function Files:open_next()
   if self.f then self.f:close() end
   local nf = table.remove(self.list,1)
   if nf then
      self.f = io.open(nf)
      return true
   else
      self.f = nil
      return false
   end
end
 
function Files:read()
  local ret = self.f:read()
  if not ret then
     if not self:open_next() then return nil end
     return self.f:read()
  else
     return ret
  end
end
I need to explain an apparent inconsistency. After praising the joys of loopless programming, there's a classic copy-table loop in Files.create(). Lua programs are passed a global table called arg, which has the command-line arguments, arg[1],arg[2], etc. But there's also arg[0], the script name, and arg[-1] the actual program name. The explicit loop in question is to make quite sure we don't copy those fields!
files = Files.create(arg)
printall(words(files))

Notes on Implementation and Further Development

Most of func is a straightforward variation on the same theme; iterators and functions as closures. Section 7.1 of PiL [ref?] explains the issues well, and I've used the allwords example as the basis for words() and numbers(). fields() was first implemented in a naive way, grabbing each field in turn, but later was implemented as one call to string.find(), by creating a custom regular expression. For example, if fields 1 and 3 are needed, separated by commas, then the regexp looks like this - fields are defined as anything which is not a comma, and we capture using () the required fields.
'%s*([^,]+),[^,]+,([^,])'
The concept of sequences is very general, which means that it's easy to use the func operations with any library which provides an iterator. And that often simplifies code tremendously. For instance, here is how it can be with luasql. Consider the canonical way to access all rows of a query result:
cur = con:execute 'SELECT * FROM [Event Summaries]'
mag = -9
row = cur:fetch({},'a')
while row do
  if row.Magnitude > mag then 
     mag = row.Magnitude
  end
  row = cur:fetch(row,'n')
end
cur:close()

I can simply this by creating an iterator which keeps track of row:

function rows(cursor)
  local row = {}
  return function()
    return cursor:fetch(row,'a')
  end
end

for row in rows(cur) do
   if row.Magnitude > mag then 
      mag = row.Magnitude    
   end
end
That's already a better loop, since we don't have to call cursor:fetch twice and look after a local row. We can also implement an equivalent to fields:
function column(fieldname,cursor)
  local row = {}
  return function()
    row = cur:fetch(row,'a')
    if not row then return nil 
    else return row[fieldname]
    end
  end
end

local minm,maxm = minmax(column('Magnitude',cur))
There's no longer any explicit loops! It is of course often more efficient to use SQL WHERE clauses to constrain the sequence. The following works, but isn't the optimal way to do the job:
print(count(column('Magnitude',cur),greater_than(2)))

-- SteveDonovan


RecentChanges · preferences
edit · history
Last edited July 21, 2007 6:49 pm GMT (diff)