lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


If you think about microoptimizations, then using %p to escape all punctuations, instead of just those used in character classes will in fact escape too many characters, creating a longer pattern that will compile more slowly and will use more memory than needed. for then creating the final pattern for substitutions.

In fact, there are very few (only 4) characters that need escaping in character classes: only '%', '-', ']' and '^' (note also that '^' is special only at start of the character class, and '-' is special only between two other characters possibly escaped, you can avoid escaping them by placing them at non-special positions in the class pattern, if you write the pattern manually, but for generated patterns, you can avoid this complication and just stick to these 4):

string.escPattern=function(str) return str:gsub('[%%%-%]%^]', '%%%0') end`  

So to generate a generic character class from an arbitrary set of (ASCII) characters specified in a string, you only need to escape these four ones and surround the whole with '[]' to generate the character class pattern. you may also optionally want to avoid including the same characters multiple times in the generated pattern, but the code is more complex, and in my opinion this code exists (and is more efficient) within the native implementation of Lua patterns in the Lua library.

Note that the suggestion to use "%p" does not work as it doesn't escape the "^" and "%" (which are not punctuations like "-" and "]", they are symbols) !!!
So this is also wrong:
string.escPattern=function(str) return str:gsub('%p', '%%%0') end`  

Once this generated character class will be compiled (fast), it will also have the same runtime independantly of the text on which you'll search the pattern and optionally make replacements: the time does not depend at all on the "length" of the character class (i.e. the number of characters it matches) as the compiled regexp will become an internal array/vector/bitmap where each character of the text to parse will be looked up by a single direct access (after an initial range check for the index). You can use the character class "%p" or "%P" or "[A]" or "[A-QTZ]" or ".", the runtime performance of searches/replace will be the same as they will all match a single character. So the actual final performance (after compilation of the pattern) will be identical. If you consider the compilation time, it does not depend at all on the length text to scan but on the length of the pattern itself, and you should minimize it: a pattern "[A-Z]" is then faster to compile than "[ABCDEFGHIJKLMNOPQRSTUVWXYZ]", and as well the transformation of the set of characters in a string ".+-?" into the pattern "[.+%-?]" will be better (if you just escape the 4 special characters) than the character class pattern "[%.+%-%?]" which has two unnecessary escapes for "." and "?".


Le mar. 27 août 2019 à 20:58, szbnwer@gmail.com <szbnwer@gmail.com> a écrit :
hi there! :)


Roberto Ierusalimschy:

> The following line should do the trick:
>  s = string.gsub(s, "%W", "%%%0")


my solution was:
`string.escPattern=function(str) return str:gsub('%p', '%%%0') end`

(((im lying, it was actually:
`string.escPattern=function(str) return str:gsub('(%p)', '%%%1') end`
but the other variant looks better, and i think it should be more
optimal and micro optimization is important when the ice caps are
already melting! just think about those poor penguins and linux! :D
)))

that means less chars to escape (white spaces and control charsr), and
a non-inverted char class is maybe (i guess) faster as it is more
straightforward

and this should be legit, as the manual explicitly allows this:

"%x: (where x is any non-alphanumeric character) represents the
character x. This is the standard way to escape the magic characters.
Any punctuation character (even the non magic) can be preceded by a
'%' when used to represent itself in a pattern."

(so it says `'%W'` is legit, but the last sentence is the point, and
every magic chars are in `'%p'`, and i dont even believe that anything
else will ever become a magic char than punctuation chars...)


all the bests to all of u! :)