lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Le mar. 27 août 2019 à 22:43, Philippe Verdy <verdy_p@wanadoo.fr> a écrit :
If you think about microoptimizations, then using %p to escape all punctuations, instead of just those used in character classes will in fact escape too many characters, creating a longer pattern that will compile more slowly and will use more memory than needed. for then creating the final pattern for substitutions.

In fact, there are very few (only 4) characters that need escaping in character classes: only '%', '-', ']' and '^' (note also that '^' is special only at start of the character class, and '-' is special only between two other characters possibly escaped, you can avoid escaping them by placing them at non-special positions in the class pattern, if you write the pattern manually, but for generated patterns, you can avoid this complication and just stick to these 4):

string.escClassPattern=function(str) return str:gsub('[%%%-%]%^]', '%%%0') end`

And you can also reduce this function because '-' and '^' are not special in all positions of the character class:
string.escClassPattern=function(str) return str:gsub('[-%%%]^]', '%%%0') end`

A generic pattern escaping function that can be used to match arbitrary literal strings is similar but has to escape eight characters: "$", ".", "?", "*", "+", "(", ")", "[", and "\" in addition to the 4 previous ones. You'll note that not all these are matched by "%p" (because they are also not all punctuations.)

string.escLiteralPattern=function(str) return str:gsub('[-$%%()*+.?[\\%]^]', '%%%0') end`

Note that above, no need to escape in the class pattern any other characters than "%" and "]", because all other characters in the class are interpreted as litterals. Note that \ is not escaped in the pattern style with a '%' but in the Lua string syntax with a '\'  (this '\' escape disappears completely at runtime, unlike '%' which remains to compile the pattern at runtime).

The two functions above cannot be unified into a single one, they are not used at all for the same purpose: the first one will be used to build a character class matching a single character, the second one will be used to build a pattern matching a multicharacter string literal.