Re: Patterns

lua-l archive
[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]
Subject: Re: Patterns
From: Sean Conner <sean@...>
Date: Wed, 14 Dec 2022 20:08:30 -0500
It was thus said that the Great Tristan Kohl once stated:
> Hi folks,
> 
> I am implementing a module that mimics a MQTT broker. For this I am
> transforming topics passed to the register function so I can run
> string.match() on incoming messages.
> 
> Right now I am stuck how to correctly handle the multilevel wildcard:
> http://docs.oasis-open.org/mqtt/mqtt/v3.1.1/os/mqtt-v3.1.1-os.html#_Toc398718107
> 
> Suppose the topic filter "/sport/#" and these published message topics:
> /sport
> /sport/racing
> /sport/racing/champion
> /sporting
> 
> Only the first three are allowed to match, so "/sport.*" is not an
> option as that would match the forth one as well. On the other hand
> "/sport/.*" would not match the first one.
> 
> Any ideas?

  Yes.  LPEG.

  Yes, there is a steep learning curve, but it can do this type of parsing. 
Here's the annotated code for this, using your examples, plus several from
the page linked.  This will support both the multilevel wildcard, plus the
single level wildcard.  It may not handle every case, but it does handle all
he cases I found.  

-- [ code ]-----------------------------------------------------------------
-- First, we load LPEG, and pull the functions used into local variables so
-- we don't have lpeg.P() and lpeg.R() all over the place.
-- -------------------------------------------------------------------------

local lpeg = require "lpeg"
local Cc   = lpeg.Cc	-- constant capture
local Cf   = lpeg.Cf	-- folding capture
local P    = lpeg.P	-- match literal text
local R    = lpeg.R	-- match a range of characters

-- -----------------------------------------------------------------------
-- Declare the LPEG patter filter.  I'm using a local scope to hide some
-- definitions we'll be using.  This is just a particular quirk of mine.
-- ------------------------------------------------------------------------

local filter do

  -- ---------------------
  -- Match a literal '/'
  -- --------------------
   
  local separator = P'/'
  
  -- ------------------------------------------------------------------------
  -- Match a topic name.  This assumes the topic consists of alphanumeric
  -- characters.  If there are others allowed (like space), you can exend
  -- this to include the other chararacters.  The bit
  --
  --	* (#separator + P(-1))
  --
  -- is there to ensure that the following character is either a slash
  -- (without cosuming it) or no more input in the string.  You can
  -- translate '*' to 'AND' and '+' as OR, with the parenthesis grouping the
  -- subexpression.
  -- -------------------------------------------------------------------------
  
  local topic = R("AZ","az","09")^1 * (#separator + P(-1))
  
  -- -----------------------------------------------------------------------
  -- Match a single topic wildcard character.  Much like the topic, we're
  -- also checking to see if the following character is a slash (without
  -- consuming it) or the end of the string.
  -- -----------------------------------------------------------------------
  
  local single = P'+' * (#separator + P(-1))
  
  -- ---------------------------------------------------------------------
  -- Match the multiple topic wildcard character.  We're checking for either
  -- a '/#' or a '#'.  This will handle some special cases later on in the
  -- code.
  -- -----------------------------------------------------------------------
 
  local multi = (P"/#" + P'#') * P(-1)
  
  -- ---------------------------------------------------------------------
  -- Now for the mind bending stuff.  We'll be parsing the string with the
  -- wildcard characters and constructing an LPEG expression to parse a
  -- topic string to tell is if it's a match or not.  So we'll take a string
  -- like:
  --
  --		/sport/#
  --
  -- and by parsing it with the following, construct an LPEG pattern that
  -- can be used to see if '/sport' and '/sporting' will match the pattern.
  -- There will be an example of this below.
  --
  -- The first expression will match the separator in the input string, and
  -- return an LPEG pattern that will match the separator.
  -- ----------------------------------------------------------------------

  local csep = separator / function() return separator end
  
  -- ---------------------------------------------------------------------
  -- This expression will match a single topic in the input string, and
  -- return an LPEG pattern that will match that exact string.
  -- ---------------------------------------------------------------------
  
  local ctopic = topic / function(c) return P(c) end
  
  -- ---------------------------------------------------------------------
  -- This expression will match the single topic wildcard characters, and 
  -- return an LPEG expression that matches a generic topic.
  -- ---------------------------------------------------------------------
  
  local csingle = single / function() return topic^-1 end
  
  -- ---------------------------------------------------------------------
  -- This expression matches the multitopic wildcard character and will
  -- return an LPEG expression that will match zero or more repeated
  -- sequences of separator character and topic, following by the end of
  -- the string.
  -- ----------------------------------------------------------------------
  
  local cmulti  = multi
                / function() return (separator * topic)^0 * P(-1) end

  -- ----------------------------------------------------------------------
  -- This expression will construct our parser from the input string.  We
  -- first check for just a plain '#' and return an LPEG expression that
  -- will check for separators and topics.  We special case this here
  -- because it's easier to do this.
  -- -----------------------------------------------------------------------

  filter = (P"#" * P(-1))
         / function() return (separator^-1 * topic)^0 * P(-1) end
         
  -- -----------------------------------------------------------------------
  -- If our input isn't just a '#' then we parse the entire string using a
  -- folding capture.  This will capture multiple patterns and it's up to
  -- the given function (second parameter) to combine them.  So here we
  -- check that the next two characters don't match '/#' (which indicates
  -- the rest of the string is a multitopic wildcard), then generate either
  -- an expression to match a topic, a single wildcard topic, or a separator
  -- and combine it with our slowing accumulating pattern.  We then check to
  -- see if we have an optional multitopic wildcard character and finally
  -- end our LPEG expression with a check for the end of the string (the
  -- 'Cc(P(-1))' bit).  Thus, we now have a parser based upon our input
  -- string.
  -- ------------------------------------------------------------------------
         
         + Cf(  
               (-P"/#" * (ctopic + csingle + csep))^0 * cmulti^-1 *
Cc(P(-1)),
               function(a,r)
                 return a * r
               end
             ) * P(-1)
end

-- [ end of code ]--------------------------

  So, for your example above, if you run:
  
	local m1 = filter:match "/sport/#"
	print(m1:match "/sport")
	print(m1:match "/sport/racing")
	print(m1:match "/sport/rancing/champion")
	print(m1:match "/sporting")
	
you should see:

	7
	14
	24
	nil

  The first three match, the fourth one doesn't, because it returned nil. 
Another example:

	local m1 = filter:match("sport/tennis/+")
	print(m1:match "sport/tennis/player1")
	print(m1:match "sport/tennis/player2")
	print(m1:match "sport/tennis/player1/ranking")

	21
	21
	nil

Yet one more:

	local m1 = filter:match("+/tennis/#")
	print(m1:match "news/tennis")
	print(m1:match "news/tennis/mcenroe")
	print(m1:match "sports/tennis/williams")
	print(m1:match "sports/tennis/williams/ranking")

	12
	20
	23
	31

And if you try to compile an invalid query, you'll get nil instead of an
LPEG expression:

	assert(not filter:match("sport/tennis/#/player1"))

  The filter expression can also split the input topic string and return an
array of topics instead of the position where the pasring stopped, but I'll
leave that as an excersise for the reader.

  -spc (It's been a while since an LPEG topic popped up)
References:
- Patterns, Tristan Kohl
Prev by Date: Re: Patterns
Next by Date: Re: Patterns
Previous by thread: Re: Patterns
Next by thread: Re: Patterns
Index(es):
- Date
- Thread