lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


hi Philippe! :)


> In fact, there are very few (only 4) characters that need escaping in character classes: only '%', '-', ']' and '^' (note also that '^' is special only at start of the character class, and '-' is special only between two other characters possibly escaped, you can avoid escaping them by placing them at non-special positions in the class pattern, if you write the pattern manually, but for generated patterns, you can avoid this complication and just stick to these 4):

> string.escPattern=function(str) return str:gsub('[%%%-%]%^]', '%%%0') end`

thats right about escaping char sets, except that 5.1 (not above)
needs `'\0'` --> `'%z'` as well, but i reflected to Roberto's stuff
actually, but not exactly to Scott (op) :D otherwise all the magic
should be escaped...

[side note, but still can be interesting :D ]
my use case was to match `'...veryLongTrimmedPath/whateverFile.lua'`
(without the `'...'`) from the error messages to files that my
monkeypatched `require()` (and `loadfile()` and `dofile()`, but i dont
use them if im right) registering (by looking up possible paths from
`package.path`; `package.cpath` isnt important here, maybe nasty
source code lookup for function definitions could make it interesting)
when the files are loaded, so i can print a few lines with the
tracebacks. cuz simple things like `'.'` in the paths could give some
side effects. however later i discovered that the 4th arg of
`string.find()` can do the trick, even if it gives me the position
instead of the substring, but i think thats still a cheaper way to go,
but still more straightforward... ive made recently a function that
can search in strings/files (i mentioned it earlier somewhere with a
few more details) and i was happy that i already had this simple gem
around, and then i realized the `string.find()` stuff that was fine
also there, and even better, cuz it plays better with sub-patterns, as
i needed something like `'()('.. term.. ')()'` to get the position...
so this invalidated my use cases, but i still kept that nice one-liner
gem. :D
[/side note]

branching for `'^'`, `'-'` and whatever can take away some
performance, i think, but i didnt test it


> Note that the suggestion to use "%p" does not work as it doesn't escape the "^" and "%" (which are not punctuations like "-" and "]", they are symbols) !!!

from the point of view of lua (inherited from c (`ispunct()`)), they are :D


> Once this generated character class will be compiled (fast), it will also have the same runtime independantly of the text on which you'll search the pattern and optionally make replacements: the time does not depend at all on the "length" of the character class (i.e. the number of characters it matches) as the compiled regexp will become an internal array/vector/bitmap where each character of the text to parse will be looked up by a single direct access (after an initial range check for the index).

that sounds good, and also possible, maybe luajit (and/or lpeg and/or
5.4) does that, but its not the case with 5.1-5.3, see:
https://www.lua.org/source/5.1/lstrlib.c.html#matchbracketclass and
`match_class()` right above, otherwise `isalpha()` and the like maybe
does that on individual char classes, but i think they are just doing
some range checks. i have some notes about this topic, but i forgot to
record the source, maybe wikipedia gave it:
"#define isdigit(x) ((x) >= '0' && (x) <= '9')
This can cause problems if x has a side effect---for instance, if one
calls isdigit(x++) or isdigit(run_some_program())"
and i think that these works the same but they arent implemented as
macros anymore

btw i like the idea, a bool array from 0 to 255 could really do the
trick, but it would require 256bytes for every char set, so, on the
contrary, thats not memory efficient...


> string.escLiteralPattern=function(str) return str:gsub('[-$%%()*+.?[\\%]^]', '%%%0') end`

you could still utilize `'%p'` here if not all the magic chars would
be covered, but that have side effects, so this one is only for an
another variation :D utilizing `'%p'` is still the most simple and
straightforward in most cases, but there are cases where precision
would require this latter version that you made, so it actually has
its pure right to exist! :D


bests! :)