lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


I have been using LPeg happily for quite a while for parsing comma
separated value (CSV) data and would like to share a bit of experience
I've gained which resulted in a few small but important tweaks to the
example grammar given on the LPeg page [1].  The first was a
recommendation I'll reiterate from Duncan Cross about a year ago [2]
to change the definition of record listed at [1] to this.

local record = lpeg.Ct(field * (',' * field)^0) * (lpeg.P'\n' + -1)

This ensures the fields are returned as a list.  To enable parsing of
tab separated value (TSV) data, both field and record change to this.

local field = '"' * lpeg.Cs(((lpeg.P(1) - '"') + lpeg.P'""' / '"')^0) *
             lpeg.C((1 - lpeg.S',\t\n"')^0)
local record = lpeg.Ct(field * ((lpeg.P(',') + lpeg.P('\t')) *
field)^0) * (lpeg.P'\n' + -1)

Finally, to enable space around quoted fields, change field to this.

local field = lpeg.P(' ')^0
             * '"' * lpeg.Cs(((lpeg.P(1) - '"') + lpeg.P'""' / '"')^0) * '"'
             * lpeg.P(' ')^0
             + lpeg.C((1 - lpeg.S',\t\n"')^0)

Here's an example that puts it all together including three sample
strings that the original grammar doesn't recognize.  (Well, it
recognizes the first one but it doesn't break it in to a table.)

require('lpeg')

local field =
  lpeg.P(' ')^0
  * '"' * lpeg.Cs(((lpeg.P(1) - '"') + lpeg.P'""' / '"')^0) * '"'
  * lpeg.P(' ')^0
  + lpeg.C((1 - lpeg.S',\t\n"')^0)

local record =
  lpeg.Ct(field * ((lpeg.P(',') + lpeg.P('\t')) * field)^0)
  * (lpeg.P'\n' + -1)

function csv (s)
 return lpeg.match(record, s)
end

local ex1 = '"a field, containing a comma",123,3.14,2.717'
local ex2 = 'val1\tval2\tval3\t'
local ex3 = '1, "a field, containing two commas, surrounded by space" , 3, 4'

local line = csv(ex1)
assert(line[1] == 'a field, containing a comma')

line = csv(ex2)
assert(line[1] == 'val1')

line = csv(ex3)
assert(line[2] == 'a field, containing two commas, surrounded by space')
EOF (tested with Lua 5.1.3)

In the definition for field that I use every day, where the value for
field has this,

lpeg.Cs(((lpeg.P(1) - '"') + lpeg.P'""' / '"')^0)

I have lpeg.C instead of lpeg.Cs.  I don't remember why I did this but
it seems to work either way.  Feel free to update the LPeg site with
these amendments.

   Ken

1. http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html
2. http://lua-users.org/lists/lua-l/2007-11/msg00358.html