[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Patterns: Why are anchors not character classes?
- From: Dirk Laurie <dirk.laurie@...>
- Date: Sat, 18 Jul 2015 19:32:06 +0200
2015-07-18 15:59 GMT+02:00 John Hind <john.hind@zen.co.uk>:
> Date: Fri, 17 Jul 2015 14:21:37 +0200 Dirk Laurie <dirk.laurie@gmail.com>
>>The last few posts have completely ignored Roberto's comment.
>
>>2015-07-16 14:53 GMT+02:00 Roberto Ierusalimschy <roberto@inf.puc-rio.br>:
>>>> `[set]` is hardcoded to match one character of the subject or to
>>>> report no match.
>>>
>>> This is not hardcoded only in the code. It is "hardcoded" in the
>>> definition of a character class:
>>>
>>> Lua 5.3 Reference Manual, 6.4.1:
>>> A character class is used to represent a set of characters.
>>>
>>> Set of characters cannot contain empty strings...
>
>>Debating what notation to use for something that cannot be a character
>>class seems related to exterior-designing the velocipede enclosure :-)
>
> This is the kind of theological nit-picking that really winds me up on this
> list ...
It's not merely theological, it is mathematical, it establishes a definition.
Definitions are practical; they are needed to make sure that programs
work.
Look at the actual code from lstrlib.c where the bracket set is tested for.
static int matchbracketclass (int c, const char *p, const char *ec) {
int sig = 1;
if (*(p+1) == '^') {
sig = 0;
p++; /* skip the '^' */
}
while (++p < ec) {
if (*p == L_ESC) {
p++;
if (match_class(c, uchar(*p)))
return sig;
}
else if ((*(p+1) == '-') && (p+2 < ec)) {
p+=2;
if (uchar(*(p-2)) <= c && c <= uchar(*p))
return sig;
}
else if (uchar(*p) == c) return sig;
}
return !sig;
}
Suppose you use %i or any other %-prefaced notation.
match_class(c,'i') will be invoked. Suppose you use $
or any other magic-ish character. c=='$' will be tested for.
In neither case are you in a position to test for begin or
end of subject. The information about where you are
in the pattern is not available.
Therefore you also need to do something in the places where
matchbracketclass is called from. Such as pass the match
state, which does know where you are, as a parameter.
And have a way of returning the information that you are
actually matching a length-zero string, not just one character.
But it's not that simple. matchbracketclass is inter alia called
by singlematch. This routine in turn is called in three places.
One of those is max_expand. Let's think about what
string.match("foo","[%iabcde]+") is going to mean. We don't
want an infinite loop, do we? So the second time round it
should not
I could go on. I did go on when I tried implementing this notion
because there are other pattern-processing schemes that can
handle it and I agree it would be useful. But my point is that
implementing the suggestion via the character class mechanism
would require deep understanding and considerable redesigning
of the string library.
Let's look again at what you need these for. It's to force
a pattern item that can't match an empty string to match it
if necessary when at the beginning or end of the subject.
So how about introducing another couple of suffixes?
[%b,]< could mean: [%b,]* at the beginning and [%b,]+ elsewhere.
[%b,]> could mean: [%b,]* at the end and [%b,]+ elsewhere.
This can be implemented in C function `match` with just a few
lines.