lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


It was thus said that the Great Andrew Gierth once stated:
> >>>>> "Coda" == Coda Highland <chighland@gmail.com> writes:
> 
>  Coda> Your discovery that it can't be done without loops is also fairly
>  Coda> accurate. CSV parsing is one of the classic examples of "you
>  Coda> really shouldn't try to do that with a regexp". If it's possible
>  Coda> for values to CONTAIN quotes (i.e. by escaping) instead of just
>  Coda> being DELIMITED by them, it's actually impossible (unless you use
>  Coda> some Perlisms that go beyond the technical formalism of regular
>  Coda> expressions).
> 
> Nonsense; CSV is clearly a regular language even when allowing quotes
> inside the values.
> 
> Here is the definition from RFC4180 (excluding the obvious terminals):
> 
>   file = [header CRLF] record *(CRLF record) [CRLF]
>   header = name *(COMMA name)
>   record = field *(COMMA field)
>   name = field
>   field = (escaped / non-escaped)
>   escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
>   non-escaped = *TEXTDATA

  Well, TEXTDATA wasn't all that obvious to me---turns out it excludes the
double quote and comma.

  But the above could be dropped in nearly verbatim (some minimal
translation required) into re (a part of LPeg) making it more readable than:

> ^(("([^"]|"")*"|[^",\r\n]*)(,"([^"]|"")*"|,[^",\r\n]*)*(\r\n|$))*$

  Gesundheit [1].

  -spc

[1]	For non-US readers, it is customary when someone sneezes to say
	"gesundheit" [2].  No, I don't know why or where that comes from.

[2]	German for "healthyness" with the meaning "good health".