Re: FQDN Pattern/regex LUA help

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: FQDN Pattern/regex LUA help
From: Lars Müller <appgurulars@...>
Date: Tue, 27 Dec 2022 19:46:56 +0100

This particular pattern will presumably match in linear time, assuming
it matches:

The first item greedily captures characters until the first minus sign.
Since the character class does not contain the minus sign, this will
reach the first minus sign in linear time: If a character is a minus
sign, it needs to match the minus sign, otherwise it can't be in the
character class. Same for the next pattern item which goes until the
first dot from there. Finally, everything else will just be greedily
matched until the end.

I assume that the space complexity will usually (perhaps always?) be
linear in the size of the string: All captures are (distinct,
non-overlapping) substrings of the string and can thus be stored using
(roughly) at most as much memory as the string. The size of "state
information" is bounded by the size of the stack. The size of the stack
is usually bounded by the string length assuming only characters of the
pattern trigger a recursive call that needs to go on the stack. In
particular this should be the case for all quantifiers that match at
least one item.

The reason why it's this efficient in this particular case is that there
are no "choices": A pattern item either matches at a location in the
string or it doesn't match, in which case you have to advance to the
next item; there are no two paths for Lua to explore.

In the worst case, these "choices" provide Lua with exponentially many
paths to explore, making the matching performance is abysmal
(exponential time complexity). This is because pattern matching is (for
simplicity, I assume, and probably because it's easier to bolt on
extensions like frontier- or bracket pattern items) naively implemented
recursively rather than using a proper regex engine, which would convert
regular expressions (note that not all patterns are regular expressions,
and not all regular expressions are patterns however) to NFAs using
Thompson's algorithm to then convert them to DFAs using the powerset
construction, which can then be used to do string matching using linear
time & space in the length of the string.

On 27.12.22 17:37, bil til wrote:

Am Di., 27. Dez. 2022 um 11:38 Uhr schrieb Hartmut Henkel
<hartmut_henkel@gmx.de>:

local h1, h2, h3 = string.match(s, "([^%-]+)%-([^%.]+)%.(.+)")

... just curiosity: Do you have any idea, how much time / RAM you need
for such a match / gsub construction with such more complicated
pattern search as in this example?

... or any "fast way" to get the "approximate time/RAM" consumption by
such a pattern search/match?

(you could also use something like s:sub( (s:find( '-', true))+1,
(s:find( '.', true))) - I wonder whether this would be faster or
slower...).

References:
- FQDN Pattern/regex LUA help, vale.gua
- Re: FQDN Pattern/regex LUA help, Hartmut Henkel
- Re: FQDN Pattern/regex LUA help, bil til

Prev by Date: Re: FQDN Pattern/regex LUA help
Next by Date: Is there a order to running finalizers of garbage collected objects?
Previous by thread: Re: FQDN Pattern/regex LUA help
Next by thread: Is there a order to running finalizers of garbage collected objects?
Index(es):
- Date
- Thread