Repeated processing of large datasets

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Repeated processing of large datasets
From: "John Logsdon" <j.logsdon@...>
Date: Tue, 28 Mar 2017 12:19:31 +0100

Greetings to all

I am processing some large datasets that are currently stored as .csv
files and I can slurp them all into memory.  I only want specific columns.

The datasets are typically a few million records, each with up to 100
columns of which I am only interested in 20 or so.

So the first thing I do is to slurp it all into memory and discard the
unwanted data thus:

local function readValues(f)
  local Line = f:read("*l")
  if Line ~= nil then
    Line = split(string.gsub(string.gsub(Line,"[\n\r]","")," +"," "),",")
    return {Line[i1],Line[i2],Line[i3]}
  end
end

where i1, i2, i3 etc have been pre-calculated from the header line

Then in the main program I read each line at a time:

local Linez=readValues(tickStream)
while Linez ~= nil and #AllLinez < maxLines do
	table.insert(AllLinez,Linez)
	Linez=readValues(tickStream)
end

So that AllLinez holds the data I need.

The processing is then a matter of looping over AllLinez:

for thisLine = 1,#AllLinez do
	V1,V2,V3 = unpack(AllLinez[thisLine])
-- ... and then the data are processed
-- 
end

Processing involves a very large number of repeated optimisation steps so
it is important that the data are handled as efficiently as possible.  I
am using luajit of course.

My question is whether this is an efficient way to process the data or
would it be better to use a database such as SQLITE3?

[apologies that the mail nipped out before completion - finger problem!]

TIA

John

John Logsdon
Quantex Research Ltd
+44 161 445 4951/+44 7717758675



Best wishes

John

John Logsdon
Quantex Research Ltd
+44 161 445 4951/+44 7717758675

Follow-Ups:
- Re: Repeated processing of large datasets, Francisco Olarte
- Re: Repeated processing of large datasets, Peter Pimley
- Re: Repeated processing of large datasets, Geoff Leyland

Prev by Date: Repeated processing of large datasets
Next by Date: Re: Back and forth UTC dates
Previous by thread: Repeated processing of large datasets
Next by thread: Re: Repeated processing of large datasets
Index(es):
- Date
- Thread