lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Le mer. 26 juin 2019 à 20:50, Russell Haley <russ.haley@gmail.com> a écrit :

%S didn't work because the file uses Windows line endings and I'm working in Ubuntu (thank you for the hint v).  As soon as I converted the line endings it works as expected. So I am now using:

local l = line:gsub("%S+\r$", "newval"..i)

Thing would have worked as expected if you had used  line:gsub('%S+(%s*)$', 'newval' .. i..'$1'),
because it would have preserved the line-ends (also other optional whitespaces at end of lines).
So you would not even have to convert the line ends between ISO/MIME/DOS/LegacyWindows and Linux.

But if you want performance for processing a whole file, your code should avoid splitting lines individidually, and should use large buffers (you'll split your buffer jut before the last occurence of '[\r\n]', and keep that in a small cache that will be prepended to the buffer you'll fill for the next block, until you reach end of file, which may not always be terminated by newlines but that will always match the "$").

In that case, you can use patterns compiled to work at line end boundaries or end of buffer (occuring only at end of files). For the second case, if the buffer to process (including the prepended rest of the previous block) does not terminate with a [\r\n], just append one "virtually" (i.e. only for the source of substitution, but remove it from the substitution before writing the result in the new file).

Your code will then be much faster (it should be to process buffers of about 256KB(+prepended data from the previous block not terminated by a newline) without problem, and with less memory overhead and at a speed very close to the I/O limits on disk or many networks (RAID disks typically obtain their maximum reading/writing speed with block sizes about 64KB).

May be even 256KB is excessive if you have some memory constraints, then use 64KB, it will still be much faster and memory efficient than processing large files line by line with many temporary short strings that will harness the Lua garbage collector). Experiment with your environemetn what would be the fastest size, then look at the Luya memory overhead in the garbage collector statistic: the lower the blocksize, the more you'll have overheads in memory and the slower your code will be.

----

Note that the _expression_ '%S+(%s*)$' may be very greedy in '%S+', and in '%s*' : it can collect arbitrarily long "words" (not-spaces) which could take significant space in memory but would result in non-sense output from what was actually an incorrect input (not the correct text format).

You may want to supply a reasonable maximum size for the final word, and then raise an error if some files happens to have too long "garbage" at end of lines. The same is true of (%s*) which may be arbitrarily long. So you may want to detect files (most probably garbage) that exhaust this maximum, by

- detecting files that cannot be valid UTF-8 text files and are most probably binary files, if they contain '[\0\240-\255]' (you may extend this set to other undesired ASCII controls, such as '[\0-\8\14-\31\127\240-\255]'), then
- detecting '%s{129}$' as invalid text and then
- detecting '%S{129}(%s{,128})$' as invalid text, before
- doing the actual substitution with '%S{,128}(%s{,128})$'

(change 128 and 129 above by the reasonable limits you accept for whitespaces at end of lines or for the last word to replace in these lines)