lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Sat, Oct 26, 2013 at 9:23 AM,  <meino.cramer@gmx.de> wrote:
> This code extracts from a file with text in columns,
> use the columns contents of each line as key and count
> their occurence.

Is this real code or pseudo code? I cannot make sense of it.


> for i,v in pairs(fields) do
** Here you have i,v
>   -- "data.txt" the columns with values
>   fh=io.open( "data.txt","r" )
>   index=fields[i]
** And here you get index, which should be in v.
>   first=index[1]
>   last=index[2]
>   for line in fh:lines() do
*** Premature optimization is the root of all evil, but normally when
you have a nested loop you put the heaviest outside, in this case you
seem to be rereading the file once for each column. I've never seen
this done this way, always read the file once and do every operation
on the line. This is normally much faster and also, if you factor the
opening of the file out of the loop, you can make it work with special
files ( like sockets or stdin ) if the need arise.

>     text=string.sub( line, first, last )
>     if not data[i] then
>         data[i]={}
>     end
>     data[i][text]-data[i][text] and (data[i][text]+1) or 1
*** This is what puzzled me. Shouldn't it have an '=' ?
>   end
>   io.close( fh )
> end
>
> @Tom: I will optimize the code (reading the file) later if all works! :)
Prematue and all, but thing on the acquiring the habit of not reading
files in inner loops.

> When dumping the resulting table there were many high counts -- that
> is: many column entries were doubled.
> Now I want to reconstruct a new table from the current one and from
> the textual input which links all things together, so one is able to
> "ask the new table" for example "reconstruct all lines, which column
> "AAA" is "17.6955 A".

What you are trying to do is to build an index in this case. What you
do for this kind of problem depends of your speed needs and wether you
have a enough memory to store all the file lines. I would so something
like...

fh=io.open( "data.txt","r" )

data={}

-- preinitialize data so code below is easier.
for label, index in pairs(fields) do
  data[label]={}
end

for line in fh:lines() do
  for i,index in pairs(fields) do
    -- Note pairs already gives you the index vars,
    first=index[1]
    last=index[2]
    text=string.sub( line, first, last )
    coldata = data[i]
    local textdata=coldata[text] or {} -- get data for this value.
    textdata[text]=line -- store the line, you could use a line number
or file offset to save space and then reread the file later.
    coldata[text]=textdata -- In case it got created two lines above.
  end
end

io.close( fh )

Now you have a list of all the lines with a particular value stored
under the column and value, data["AAA"]["17.6955 A"] would answer your query.

If the file is very big you normally store only the line numbers, and
to do the query you first collect all the line numbers for your
search, sort them,  and then rewind and reread the file extracting the
needed lines. Or, if you are 100% sure it's a random access file, you
can store the line offsets on disk ( it will be faster if you need to
extract a little lines from a big file ).

By the way, this is more or less what databases do with their indexes,
but they store 'tuples' instead of lines.

Note how I put the open-close at top-bottom, this way I can
split the code like this:

function do_disk_file(filename)
  local fh = io.open(filename,"r");
  data=do_handle(fh)
  io.close(fh)
  return data

function do_handle(fh)
  .... all the code except open / close
  return data
end

In fact, it would be even better to make it work with an iterator, but
this version should be enough to be able to work with either a file, a
socket and a pipe ( also, think on stdin, it may be an already opened
disk file ( due to redirections ) fro which you do not have a name.

Also, if you NEED ( not in this case ) to reread a file it's FASTER
and SAFER to rewind ( IIRC this is done with fh:seek("set") in lua ).
Otherwise someone can do evil things behind the scenes ( like deleting
the file and putting a different one in it's place ).>

Francisco Olarte.