lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Learning sed is essentially learning the more common "regexps" (used in so many tools for "lunic's/null'ix/nil'ux/l'inox" and many text editors, including vi(m), emacs, and many other visual editors for X-based graphic desktops).

Many people (and I think all programmers) can't "live" without the more conventional (and more powerful) regexps, but most can live without the (severely limited) Lua patterns !

On opensourced versions of sed and all version for Linux, sed can use Perl-style regexps, or several regexp dialects, including older BSD style or newer PCRE, which are very similar and in fact equivalent for your goal, as these differences are only in advanced greedy options, or variable substitution, and that also supports the Shell-style file patterns; if Lua resists for long enough, sed/ed/vi/vim may integrate the Lua pattern style.

Note however that basic sed syntax does not let you change the value of some field in the replacement string with a computed variable (which requires some minimum scripting support for such custom computing function, like a simple counter value, or like a custom text transform that is not just a basic one-to-one remapping of letter case).

But you've got that shell scripting support in Bash which also supports the same regexps (so much that sed has been inlined into Bash or Busybox) !



Le mer. 26 juin 2019 à 23:09, Russell Haley <russ.haley@gmail.com> a écrit :


On Wed, Jun 26, 2019 at 1:58 PM Philippe Verdy <verdy_p@wanadoo.fr> wrote:
You did not need a Lua program then, Notepad++ would have done that directly by loading the 90 files and using its regexp, or you could have used "sed" on Linux.
I considered Geany and sed but Lua is my preferred hammer. I really should learn sed...

:-)
Russ

Le mer. 26 juin 2019 à 22:56, Russell Haley <russ.haley@gmail.com> a écrit :


On Wed, Jun 26, 2019 at 1:25 PM Philippe Verdy <verdy_p@wanadoo.fr> wrote:
Le mer. 26 juin 2019 à 20:50, Russell Haley <russ.haley@gmail.com> a écrit :

%S didn't work because the file uses Windows line endings and I'm working in Ubuntu (thank you for the hint v).  As soon as I converted the line endings it works as expected. So I am now using:

local l = line:gsub("%S+\r$", "newval"..i)

Thing would have worked as expected if you had used  line:gsub('%S+(%s*)$', 'newval' .. i..'$1'),
because it would have preserved the line-ends (also other optional whitespaces at end of lines).
So you would not even have to convert the line ends between ISO/MIME/DOS/LegacyWindows and Linux.
Thanks Phillip, this did work as well. I've already re-processed the 90 files and sent them back to the client, but I'll know to check my email for this next time!
Russ  

But if you want performance for processing a whole file, your code should avoid splitting lines individidually, and should use large buffers (you'll split your buffer jut before the last occurence of '[\r\n]', and keep that in a small cache that will be prepended to the buffer you'll fill for the next block, until you reach end of file, which may not always be terminated by newlines but that will always match the "$").

In that case, you can use patterns compiled to work at line end boundaries or end of buffer (occuring only at end of files). For the second case, if the buffer to process (including the prepended rest of the previous block) does not terminate with a [\r\n], just append one "virtually" (i.e. only for the source of substitution, but remove it from the substitution before writing the result in the new file).

Your code will then be much faster (it should be to process buffers of about 256KB(+prepended data from the previous block not terminated by a newline) without problem, and with less memory overhead and at a speed very close to the I/O limits on disk or many networks (RAID disks typically obtain their maximum reading/writing speed with block sizes about 64KB).

May be even 256KB is excessive if you have some memory constraints, then use 64KB, it will still be much faster and memory efficient than processing large files line by line with many temporary short strings that will harness the Lua garbage collector). Experiment with your environemetn what would be the fastest size, then look at the Luya memory overhead in the garbage collector statistic: the lower the blocksize, the more you'll have overheads in memory and the slower your code will be.

----

Note that the _expression_ '%S+(%s*)$' may be very greedy in '%S+', and in '%s*' : it can collect arbitrarily long "words" (not-spaces) which could take significant space in memory but would result in non-sense output from what was actually an incorrect input (not the correct text format).

You may want to supply a reasonable maximum size for the final word, and then raise an error if some files happens to have too long "garbage" at end of lines. The same is true of (%s*) which may be arbitrarily long. So you may want to detect files (most probably garbage) that exhaust this maximum, by

- detecting files that cannot be valid UTF-8 text files and are most probably binary files, if they contain '[\0\240-\255]' (you may extend this set to other undesired ASCII controls, such as '[\0-\8\14-\31\127\240-\255]'), then
- detecting '%s{129}$' as invalid text and then
- detecting '%S{129}(%s{,128})$' as invalid text, before
- doing the actual substitution with '%S{,128}(%s{,128})$'

(change 128 and 129 above by the reasonable limits you accept for whitespaces at end of lines or for the last word to replace in these lines)