[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Semantics of empty matches in string.gmatch and string.gsub
- From: Dirk Laurie <dirk.laurie@...>
- Date: Tue, 30 Apr 2013 08:55:26 +0200
2013/4/29 Dirk Laurie <dirk.laurie@gmail.com>:
> Let's say there is no bug in `gmatch` and `gsub`; the choice of rule
> is an implementation detail.
In order to emphasize the above point, I have forked the thread.
Let's look at the question from a mathematical point of view.
Think of a string as a piecewise constant function defined on a closed
interval [0,n]. Students are taught to draw such functions with little
circles or black disks at the discontinuities. For strings, the
difference is immaterial: the way a string is printed, e.g. "abcd",
cannot show it.
Repeated string matching, as in gsub and gmatch, is a way of
partitioning the interval into subintervals with breakpoints only
allowed at the integers. If empty matches are disallowed, as in
LPEG, the difference between open and closed intervals is still
immaterial.
It only becomes important when patterns that can match empty strings
are allowed. Unfortunately, the usual ways of reporting a match,
as a substring or as a pair, do not make that distinction.
For the present argument, that distinction is needed, so imagine
a gmatch-like routine that reports its results as mathematical
intervals, with the usual meaning of round and square brackets.
I.e. the empty string at the start is [0,0], a string consisting
of the first character could be [0,1) or [0,1], a string consisting
of the second character could be any of [1,2), (1,2), (1,2] or [1,2],
etc.
We demand of a routine that does iterated matching that it returns
non-overlapping intervals. This requirement is enough to prevent
an infinite loop. An empty string can only be [i,i], it uses up
the point at i, the next match cannot be [i,i] again.
The following rule may be a good way of prescribing the behaviour of
iterated matching (phrasing but not substance borrowed from the Perl
manual):
The higher-level loops preserve an additional state between iterations:
whether the last match included its endpoint. If so, a zero-length
match is prohibited at that point.
Now look at the four pieces returned by the current `string.gmatch`
for the source ";a;" and pattern "a*".
"" [0,0]
"a" [2,3)
"" [3,3]
"" [4,4]
And the three pieces returned by `sed` or Python's re.subn:
"" [0,0]
"a" [2,3]
"" [4,4]
With this distinction available, the question reduces to:
should "a*" return [2,3) or [2,3]?
Adam made a good point here: a greedy quantifier should make
the longest possible match. One could generalize that as follows:
1. A zero-length match always includes its endpoint.
2. A greedy quantifier always includes its endpoint.
3. A non-greedy quantifier does not include its endpoint, except when
the match is empty.