[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Reading CSV
- From: Sean Conner <sean@...>
- Date: Tue, 3 Dec 2013 17:42:51 -0500
It was thus said that the Great Coda Highland once stated:
> On Tue, Dec 3, 2013 at 11:10 AM, Sean Conner <sean@conman.org> wrote:
> > It was thus said that the Great Geoff Leyland once stated:
> >> Hi,
> >>
> >> What’s the current best option for a CSV (or tab separated, for that
> >> matter) file?
> >>
> >> I’ve had a look at http://lua-users.org/wiki/CsvUtils and
> >> http://lua-users.org/wiki/LuaCsv, searched LuaRocks (nothing came up, but
> >> perhaps I’m using the wrong search term) and looked at Penlight’s
> >> data.read. As far as I can tell, most solutions either:
> >> - read the whole file in one go (constructing a table of all the values
> >> becomes impractical as files get larger)
> >> - read lines with “*l” and so are opinionated about what constitutes a
> >> newline
> >> - don’t handle embedded newlines in quoted fields
> >>
> >> There’s also an LPeg example, but as I understand it, LPeg works on whole
> >> strings, not file streams?
> >
> > Yes, but you can read a line at a time and use LPeg to break the line
> > down. You mentioned that there are issues with what constitutes a newline,
> > but there are ways around that. One method I use is:
>
> You missed the part about handling newlines in quoted fields.
You're right---I did miss that. A similar approach can still be done.
Just define a single line (from RFC-4180):
lpeg = require "lpeg"
TEXTDATA = lpeg.R("\32\33","\35\43","\45\126")
COMMA = lpeg.P","
DQUOTE = lpeg.P'"'
CR = lpeg.P"\r"
LF = lpeg.P"\n"
CRLF = CR^-1 * LF -- Unix doesn't really use CR
non_escaped = lpeg.C(TEXTDATA^0)
escaped = DQUOTE
* lpeg.C((TEXTDATA + COMMA + CR + LF + DQUOTE * DQUOTE)^0)
* DQUOTE
field = escaped + non_escaped
record = lpeg.Cf(
lpeg.Ct"" * field * (COMMA * field)^0,
function(t,v)
t[#t + 1] = v
return t
end
)
* CRLF -- assuming all lines end with CRLF
* lpeg.Cp()
Now, you just need to ensure you have at least one lines worth of data.
Something like:
do
local buffer = ""
local pos = 1
function next_record(file)
local items,newpos = record:match(buffer,pos)
if items == nil then
-- ----------------------------------------------
-- not enough data for a record, replenish buffer
-- first, shift the unprocessed buffer down, then
-- append more data.
-- ----------------------------------------------
buffer = buffer:sub(pos,-1)
pos = 1
local data = file:read(65536) -- adjust to taste
-- -------------------------------------------------------------------
-- if there is no data, just append two "\n\n" to (maybe) end a
-- partial record and to mark an empty record. Othersise, append the
-- data we just read.
-- -------------------------------------------------------------------
if data == nil then
buffer = buffer .. "\n\n"
else
buffer = buffer .. data
end
-- -----------------------
-- try again
-- -----------------------
return next_record(file)
end
pos = newpos
return items
end
end
while true do
local record = next_record(file)
if record[1] == "" then break end
-- process record
end
Basically, you read in a large chunk of data, process it record-by-record
until you can't, then read in more data. That way, you don't have to load
the entire data into memory, and you can still process it with LPeg.
-spc (Again, code is MIT licensed)
[1] http://tools.ietf.org/html/rfc4180