Frontier Pattern

lua-users home
wiki

The "frontier" expression pattern %f was a previously undocumented Lua pattern feature (for reasons why it was undocumented, see LuaList:2006-12/msg00536.html). %f allows matching the transition from a character not found in set of characters, to a character that is found in a set of characters.

Functionally, it serves a purpose similar to the \b regular expression escape sequence, allowing one to "match" the transition from one set of characters to another set.

Let's consider a fairly straightforward task: to find all words in upper-case in a string.

First attempt: %u+

string.gsub ("the QUICK brown fox", "%u+", print)

QUICK

That looks OK, found a word in all caps. But look at this:

string.gsub ("the QUICK BROwn fox", "%u+", print)

QUICK
BRO

We also found a word which was partially capitalized.


Second attempt: %u+%A

string.gsub ("the QUICK BROwn fox", "%u+%A", print)

QUICK

The detection of non-letters correctly excluded the partially capitalized word. But wait! How about this:

string.gsub ("the QUICK brOWN fox", "%u+%A", print)

QUICK 
OWN 

We also have a second problem:

string.gsub ("the QUICK. brown fox", "%u+%A", print)

QUICK.

The punctuation after the word is now part of the captured string, which is not wanted.

Third attempt: %A%u+%A

string.gsub ("the QUICK brOWN FOx jumps", "%A%u+%A", print)

 QUICK

This correctly excludes the two partially capitalised words, but still leaves the punctuation in, like this:

string.gsub ("the (QUICK) brOWN FOx jumps", "%A%u+%A", print)

(QUICK)

Also, there is another problem, apart from capturing the non-letters at the sides. Look at this:

string.gsub ("THE (QUICK) brOWN FOx JUMPS", "%A%u+%A", print)

(QUICK)

The correctly capitalised words at the start and end of the string are not detected.

The solution: The Frontier pattern: %f

string.gsub ("THE (QUICK) brOWN FOx JUMPS", "%f[%a]%u+%f[%A]", print)

THE
QUICK
JUMPS

The frontier pattern %f followed by a set detects the transition from "not in set" to "in set". The source string boundary qualifies as "not in set" so it also matches the word at the very start of the string to be matched.

The second frontier pattern is also matched at the end of the string, so our final word is also captured.

Alternatives without the frontier pattern

Without the frontier pattern, one might resort to things like this:

s = "THE (QUICK) brOWN FOx JUMPS"
s = "\0" .. s:gsub("(%A)(%u)", "%1\0%2")
             :gsub("(%u)(%A)", "%1\0%2") .. "\0"
s = s:gsub("%z(%u+)%z", print)


--NickGammon
One might better resort to this:
('_'..s..'_'):gsub('%A(%u+)%A', print)
--DmitryGaivoronsky

Not quite:

s = "THE QUICK brOWN FOx JUMPS"
('_'..s..'_'):gsub('%A(%u+)%A', print)
--> THE JUMPS

You can do this:

(' '..s..' '):gsub('%A+', '  '):gsub(' (%u+) ', print)
--> THE QUICK JUMPS

or this:

s:gsub('%a+', ' %1 '):gsub(' (%u+) ', print)
--> THE QUICK JUMPS

The pattern can be extended to more general statements: "Find all words that are at least four characters and are either all lowercase or all uppercase"...

s = "THE QUICK brOWN FOx JUMPS over"
s:gsub('%a+', ' %1 ')      -- identify words with ' (%a+) '
                           -- (all following patterns match a subset of this)
 :gsub(' %u+%l+%a* ', '')  -- subtract mixed case words starting with upper
 :gsub(' %l+%u+%a* ', '')  -- subtract mixed case words starting with lower
 :gsub(' %a%a?%a? ', '')   -- subtract words with 1-3 characters
 :gsub(' (%a+) ', print)   -- extract words
--> QUICK JUMPS over
--DavidManura

I think above example is faster and more readable in lua lpeg re:

s = "THE QUICK brOWN FOx JUMPS over"

= re.match(s, "(%A* ( {%u^+4 / %l^+4} (%A/!.) / %a+ ) )+")
QUICK   JUMPS   over
-- Albert Chan

RecentChanges · preferences
edit · history
Last edited January 25, 2024 8:51 pm GMT (diff)