[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: LPeg for CSV parsing
- From: "Ken Smith" <kgsmith@...>
- Date: Wed, 29 Oct 2008 15:20:22 -0700
I have been using LPeg happily for quite a while for parsing comma
separated value (CSV) data and would like to share a bit of experience
I've gained which resulted in a few small but important tweaks to the
example grammar given on the LPeg page [1]. The first was a
recommendation I'll reiterate from Duncan Cross about a year ago [2]
to change the definition of record listed at [1] to this.
local record = lpeg.Ct(field * (',' * field)^0) * (lpeg.P'\n' + -1)
This ensures the fields are returned as a list. To enable parsing of
tab separated value (TSV) data, both field and record change to this.
local field = '"' * lpeg.Cs(((lpeg.P(1) - '"') + lpeg.P'""' / '"')^0) *
lpeg.C((1 - lpeg.S',\t\n"')^0)
local record = lpeg.Ct(field * ((lpeg.P(',') + lpeg.P('\t')) *
field)^0) * (lpeg.P'\n' + -1)
Finally, to enable space around quoted fields, change field to this.
local field = lpeg.P(' ')^0
* '"' * lpeg.Cs(((lpeg.P(1) - '"') + lpeg.P'""' / '"')^0) * '"'
* lpeg.P(' ')^0
+ lpeg.C((1 - lpeg.S',\t\n"')^0)
Here's an example that puts it all together including three sample
strings that the original grammar doesn't recognize. (Well, it
recognizes the first one but it doesn't break it in to a table.)
require('lpeg')
local field =
lpeg.P(' ')^0
* '"' * lpeg.Cs(((lpeg.P(1) - '"') + lpeg.P'""' / '"')^0) * '"'
* lpeg.P(' ')^0
+ lpeg.C((1 - lpeg.S',\t\n"')^0)
local record =
lpeg.Ct(field * ((lpeg.P(',') + lpeg.P('\t')) * field)^0)
* (lpeg.P'\n' + -1)
function csv (s)
return lpeg.match(record, s)
end
local ex1 = '"a field, containing a comma",123,3.14,2.717'
local ex2 = 'val1\tval2\tval3\t'
local ex3 = '1, "a field, containing two commas, surrounded by space" , 3, 4'
local line = csv(ex1)
assert(line[1] == 'a field, containing a comma')
line = csv(ex2)
assert(line[1] == 'val1')
line = csv(ex3)
assert(line[2] == 'a field, containing two commas, surrounded by space')
EOF (tested with Lua 5.1.3)
In the definition for field that I use every day, where the value for
field has this,
lpeg.Cs(((lpeg.P(1) - '"') + lpeg.P'""' / '"')^0)
I have lpeg.C instead of lpeg.Cs. I don't remember why I did this but
it seems to work either way. Feel free to update the LPeg site with
these amendments.
Ken
1. http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html
2. http://lua-users.org/lists/lua-l/2007-11/msg00358.html