Re: Reading CSV

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Reading CSV
From: Sean Conner <sean@...>
Date: Tue, 3 Dec 2013 17:42:51 -0500

It was thus said that the Great Coda Highland once stated:
> On Tue, Dec 3, 2013 at 11:10 AM, Sean Conner <sean@conman.org> wrote:
> > It was thus said that the Great Geoff Leyland once stated:
> >> Hi,
> >>
> >> What’s the current best option for a CSV (or tab separated, for that
> >> matter) file?
> >>
> >> I’ve had a look at http://lua-users.org/wiki/CsvUtils and
> >> http://lua-users.org/wiki/LuaCsv, searched LuaRocks (nothing came up, but
> >> perhaps I’m using the wrong search term) and looked at Penlight’s
> >> data.read.  As far as I can tell, most solutions either:
> >>  - read the whole file in one go (constructing a table of all the values
> >>    becomes impractical as files get larger)
> >>  - read lines with “*l” and so are opinionated about what constitutes a
> >>    newline
> >>  - don’t handle embedded newlines in quoted fields
> >>
> >> There’s also an LPeg example, but as I understand it, LPeg works on whole
> >> strings, not file streams?
> >
> >   Yes, but you can read a line at a time and use LPeg to break the line
> > down.  You mentioned that there are issues with what constitutes a newline,
> > but there are ways around that.  One method I use is:
> 
> You missed the part about handling newlines in quoted fields.

  You're right---I did miss that.  A similar approach can still be done. 
Just define a single line (from RFC-4180):

lpeg = require "lpeg"

TEXTDATA    = lpeg.R("\32\33","\35\43","\45\126")
COMMA       = lpeg.P","
DQUOTE      = lpeg.P'"'
CR          = lpeg.P"\r"
LF          = lpeg.P"\n"
CRLF        = CR^-1 * LF -- Unix doesn't really use CR
non_escaped = lpeg.C(TEXTDATA^0)
escaped     = DQUOTE
            * lpeg.C((TEXTDATA + COMMA + CR + LF + DQUOTE * DQUOTE)^0)  
            * DQUOTE
field       = escaped + non_escaped        
record      = lpeg.Cf(
                lpeg.Ct"" * field * (COMMA * field)^0,
                function(t,v)
                  t[#t + 1] = v
                  return t
                end
              ) 
            * CRLF -- assuming all lines end with CRLF
            * lpeg.Cp()

  Now, you just need to ensure you have at least one lines worth of data. 
Something like:

do
  local buffer = ""
  local pos    = 1

  function next_record(file)
    local items,newpos = record:match(buffer,pos)

    if items == nil then
      -- ----------------------------------------------
      -- not enough data for a record, replenish buffer
      -- first, shift the unprocessed buffer down, then
      -- append more data.
      -- ----------------------------------------------

      buffer = buffer:sub(pos,-1)
      pos    = 1
      local data = file:read(65536) -- adjust to taste

      -- -------------------------------------------------------------------
      -- if there is no data, just append two "\n\n" to (maybe) end a
      -- partial record and to mark an empty record.  Othersise, append the
      -- data we just read.
      -- -------------------------------------------------------------------

      if data == nil then
        buffer = buffer .. "\n\n"
      else
        buffer = buffer .. data
      end

      -- -----------------------
      -- try again
      -- -----------------------

      return next_record(file)
    end

    pos = newpos
    return items
  end
end

while true do
  local record = next_record(file)
  if record[1] == "" then break end
  -- process record
end

  Basically, you read in a large chunk of data, process it record-by-record
until you can't, then read in more data.  That way, you don't have to load
the entire data into memory, and you can still process it with LPeg.

  -spc (Again, code is MIT licensed)

[1]	http://tools.ietf.org/html/rfc4180

Follow-Ups:
- Re: Reading CSV, Andrew Starks

References:
- Reading CSV, Geoff Leyland
- Re: Reading CSV, Sean Conner
- Re: Reading CSV, Coda Highland

Prev by Date: Re: Reading CSV
Next by Date: Re: Reading CSV
Previous by thread: Re: Reading CSV
Next by thread: Re: Reading CSV
Index(es):
- Date
- Thread