[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Replace specific comma's in a string.
- From: Andrew Gierth <andrew@...>
- Date: Sat, 10 Feb 2018 08:56:30 +0000
>>>>> "Coda" == Coda Highland <chighland@gmail.com> writes:
Coda> Your discovery that it can't be done without loops is also fairly
Coda> accurate. CSV parsing is one of the classic examples of "you
Coda> really shouldn't try to do that with a regexp". If it's possible
Coda> for values to CONTAIN quotes (i.e. by escaping) instead of just
Coda> being DELIMITED by them, it's actually impossible (unless you use
Coda> some Perlisms that go beyond the technical formalism of regular
Coda> expressions).
Nonsense; CSV is clearly a regular language even when allowing quotes
inside the values.
Here is the definition from RFC4180 (excluding the obvious terminals):
file = [header CRLF] record *(CRLF record) [CRLF]
header = name *(COMMA name)
record = field *(COMMA field)
name = field
field = (escaped / non-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA
which corresponds to this regexp (assuming newlines match [^] except
where explicitly excluded):
^(("([^"]|"")*"|[^",\r\n]*)(,"([^"]|"")*"|,[^",\r\n]*)*(\r\n|$))*$
Code> Meanwhile, gsub is LESS expressive than regexps.
Indeed.
--
Andrew.