Re: Patterns: Why are anchors not character classes?

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Patterns: Why are anchors not character classes?
From: Rena <hyperhacker@...>
Date: Sat, 18 Jul 2015 14:45:18 -0400

On Jul 18, 2015 1:32 PM, "Dirk Laurie" <dirk.laurie@gmail.com> wrote:
>
> 2015-07-18 15:59 GMT+02:00 John Hind <john.hind@zen.co.uk>:
> > Date: Fri, 17 Jul 2015 14:21:37 +0200 Dirk Laurie <dirk.laurie@gmail.com>
> >>The last few posts have completely ignored Roberto's comment.
> >
> >>2015-07-16 14:53 GMT+02:00 Roberto Ierusalimschy <roberto@inf.puc-rio.br>:
> >>>> `[set]` is hardcoded to match one character of the subject or to
> >>>> report no match.
> >>>
> >>> This is not hardcoded only in the code. It is "hardcoded" in the
> >>> definition of a character class:
> >>>
> >>> Lua 5.3 Reference Manual, 6.4.1:
> >>> A character class is used to represent a set of characters.
> >>>
> >>> Set of characters cannot contain empty strings...
> >
> >>Debating what notation to use for something that cannot be a character
> >>class seems related to exterior-designing the velocipede enclosure :-)
> >
> > This is the kind of theological nit-picking that really winds me up on this
> > list ...
>
> It's not merely theological, it is mathematical, it establishes a definition.
> Definitions are practical; they are needed to make sure that programs
> work.
>
> Look at the actual code from lstrlib.c where the bracket set is tested for.
>
> static int matchbracketclass (int c, const char *p, const char *ec) {
> int sig = 1;
> if (*(p+1) == '^') {
> sig = 0;
> p++; /* skip the '^' */
> }
> while (++p < ec) {
> if (*p == L_ESC) {
> p++;
> if (match_class(c, uchar(*p)))
> return sig;
> }
> else if ((*(p+1) == '-') && (p+2 < ec)) {
> p+=2;
> if (uchar(*(p-2)) <= c && c <= uchar(*p))
> return sig;
> }
> else if (uchar(*p) == c) return sig;
> }
> return !sig;
> }
>
> Suppose you use %i or any other %-prefaced notation.
> match_class(c,'i') will be invoked. Suppose you use $
> or any other magic-ish character. c=='$' will be tested for.
> In neither case are you in a position to test for begin or
> end of subject. The information about where you are
> in the pattern is not available.
>
> Therefore you also need to do something in the places where
> matchbracketclass is called from. Such as pass the match
> state, which does know where you are, as a parameter.
> And have a way of returning the information that you are
> actually matching a length-zero string, not just one character.
>
> But it's not that simple. matchbracketclass is inter alia called
> by singlematch. This routine in turn is called in three places.
> One of those is max_expand. Let's think about what
> string.match("foo","[%iabcde]+") is going to mean. We don't
> want an infinite loop, do we? So the second time round it
> should not
>
> I could go on. I did go on when I tried implementing this notion
> because there are other pattern-processing schemes that can
> handle it and I agree it would be useful. But my point is that
> implementing the suggestion via the character class mechanism
> would require deep understanding and considerable redesigning
> of the string library.
>
> Let's look again at what you need these for. It's to force
> a pattern item that can't match an empty string to match it
> if necessary when at the beginning or end of the subject.
> So how about introducing another couple of suffixes?
> [%b,]< could mean: [%b,]* at the beginning and [%b,]+ elsewhere.
> [%b,]> could mean: [%b,]* at the end and [%b,]+ elsewhere.
> This can be implemented in C function `match` with just a few
> lines.
>

Well, regardless of whether it's classified as a character class or a set or a special case or a fruit or a vegetable, I've many times wished I could write a pattern such as "[ ^]%w+[ $]" (match one or more word characters bounded by either a space or the start/end of a string). Though ^ already has a meaning there...

Of course it's important to decide on an implementation (assuming it's going to be implemented at all), but I'm starting to feel like the forest is being lost in the trees. The real feature request here is "be able to include 'beginning/end of string' in a character set"; exactly how to implement it is another question.

Follow-Ups:
- Re: Patterns: Why are anchors not character classes?, Dirk Laurie
- Re: Patterns: Why are anchors not character classes?, Tom N Harris

References:
- Re: Patterns: Why are anchors not character classes?, John Hind
- Re: Patterns: Why are anchors not character classes?, Dirk Laurie

Prev by Date: Re: Floats and %d
Next by Date: Re: Patterns: Why are anchors not character classes?
Previous by thread: Re: Patterns: Why are anchors not character classes?
Next by thread: Re: Patterns: Why are anchors not character classes?
Index(es):
- Date
- Thread