Re: How to prevent catastrophic regex?

True! If you use repeated subpatterns without maximum bounds, you should not allow the two repetitions unless they don't have ANY common prefix. In addition, if both can have a null lower bound of repetitions, one of them must be greedy and the other must not (so "[a]*[b]*" or "[a]-[b]-" should not be valid, independently of the fact that [a] and [b] have possible common prefixes or not)

Parsing and validating the regexp can help ensure that they are not causing excessive backtracking.

And even if Lua patterns don't have alternates (with '|'), there's a limited alternation allowed to the place where the two successive repetitions are to be separated, and of course the pattern should be reasonably anchored at start (with '^' or with a distinctive contextual prefix that cannot be part of what you want to capture) and at end (with '$' or with a distinctive contextual suffix that cannot be part of what you want to capture).

Unfortunately, Lua sets dont allow placing '^' or '$' anchors in them. One way to avoid this problem is to add some null byte at start and end of the text to scan, so that you can use this null byte in the set used for anchoring the start or end (this works provided that the text to scan is warrantied to not contain any null byte, or that text has first been escaped). But other contextual anchors than '^' or '$' can be useful: word boundaries (that would also need to match the start and end of text: here also the null byte trick can be used).

Ideally Lua should have some syntax like [...%q...] to match a boundary as if it was a character, part of the set but matching actually an empty string in the scanned text and that satisfies the boundary condition: start/end of text, start/end of word, i.e. the change of membership in a subclass of characters: Let's assume this is %q, it can be followed by ^ or $ to indicated the BEF/EOF condition (like "-P(1)" in LPeg)., or by a subclass:

- [A-C%q^] would match letters A to C, or at start of file '%q^'

- [A-C%q$] would match letters A to C, or at end of file '%q$'

- [A-C%q%a] would match letters A to C, or at start of sequence of letters ('%a' = alpha)

- [A-C%q%A] would match letters A to C, or at end of sequence of letters ('%A' = non-alpha)

- [A-C%q[%(%q^]] would match letters A to C, or after an opening parenthese '%(', or at start of file '%q^.

This is of course an extension to the set, where boundaries are treated as if they were additional characters (but their actual match length is 0 in any captured text).

Given that sets in Lua are limited to set of bytes they are just 256-bit bitsets. But adding boundaries to them would extend the input character set to more than 8 bit; to match no characters are compared but an boundary condition indexed by an integer 256 or higher that can be tested, several conditions being predefined: BOF, EOF, start/end of word (for the predefined class %a and its complement %A, these conditions are checked by looing at the current character and the character just before or just after it, without needing complex backtracking); Optionally some %q() would allow to use a callback to the function getting captures, so that the condition would be evaluated at the given position (the function could use its own closure definition context to keep some internal state, such as counters).

...

Le jeu. 11 juin 2020 à 17:26, Oliver Kroth <oliver.kroth@nec-i.de> a écrit :

Hi,

try to enforce use of anchors, ^ and/ or $

--
Oliver

Am 11.06.20 um 17:03 schrieb xavier.decoret@gmail.com:
> The string package has regular _expression_ matching, and they can be used to leverage https://www.regular-expressions.info/catastrophic.html. Check the following example, which takes 25 seconds to run.
>
> start = os.time()
> print('running...')
> s = string.gsub('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 'a*a*a*a*a*a*a*b', 'X')
> print('ran in ', os.difftime(os.time(), start), ' second(s)')
>
> I'd like to find a way to prevent from such "attack" on a system that would execute user provided lua scripts. The system uses a custom allocator that limits the memory that can be allocated, and a hook that calls lua_error at next instruction executed if script evaluation takes too long.
>
> Unfortunately, none of those help, because the above code doesn't need a lot of memory allocated to run, and the debug hook sees the whole gsub call as one instructions.
>
> So the only solution I see is to not expose string.gsub() to users, or to run in a sandbox that can be terminated violently after some time. Has the Lua community figured out a better solution? Have I missed something in our hooks should be used, or potential native Lua ways of "bounding" execution?
>