|
Wim Couwenberg wrote:
Anyway, here's a simplistic script to test binary-ness. Adjust the pattern in "find" to something more sensible, if you like. Usage: lua isbin.lua <file-name> --------------- file isbin.lua: --------------- local now = os.clock() local input, err = io.open(arg[1], "rb") assert(input, err) local isbin = false local chunk_size = 2^12 local find = string.find local read = input.read repeat local chunk = read(input, chunk_size) if not chunk then break end if find(chunk, "[^\f\n\r\t\032-\128]") then isbin = true break end until false input:close() now = os.clock() - now if isbin then print "this file is binary..." else print "this is a text file..." end print(string.format("this took %.3f seconds", now)) ----------- end of file -----------
Woah, so non-English text files are binary? ;-) (Perhaps by old FTP and Mail standards...)Also, I believe \127 is seen as binary (DEL code), and \128 is already in high-Ascii area...
So perhaps I would rewrite your pattern as: [^\f\n\r\t\032-\126\192-\256]Note I excluded the \128-\191 area, seen in ISO (8859-1 for example) as control characters, but if you consider the quite common Windows Ansi encoding (CP1252), it contains many valid characters, including the euro symbol, (c), (R), etc.
And, of course, your test doesn't work for UTF-8 and most other Unicode encodings. But that's another can of worms...
Additional note: many implementations of this kind of test agree that testing the first bytes (256, 512...) of a file is enough to see if it is binary or not. Perhaps it is too simplistic for some file formats, but it can work most of the time.
And I doubt there are so many text files of over 1GB... Except perhaps some exceptional log files or XML data files.
-- Philippe Lhoste -- (near) Paris -- France -- http://Phi.Lho.free.fr -- -- -- -- -- -- -- -- -- -- -- -- -- --