lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Hi:

On Fri, Feb 21, 2014 at 11:51 PM, Tim Hill <drtimhill@gmail.com> wrote:
> (my) thread summary:
> -- Using the Lua library to read text files yields unpredictable/unexpected results if the file contains embedded NUL characters.

Actually the library gives predictable results although highly
unexpected ones. It's not only the single-line, chop newlines stuff (
where "hello\0workd\n" is read as "hello" instead of "hello\0world".
It's also it can stictch lines together, make some of them dissapear.
Try just this ( on a *ix machines, windows shell syntax is too
difficult for me ):

folarte@paqueton:~/tmp$ cat lines.lua
for l in io.input():lines('*L') do
   io.output():write(l)
end
folarte@paqueton:~/tmp$ echo -ne 'Hello\0world\n \0No, not
really\nFrancisco\0Olarte\n\n' | lua lines.lua
Hello Francisco

It is predictable, I built the echo in one go ( ok, a little lie, two,
i forogt the cosmetic space before the second null on the first try ).
It will work that way on every libC which does not trat null on stdin
specially.

> -- A patch has been suggested that fixes this, at the expense of some subtle behavior changes that only occur if you rely on the old NUL behavior

Several patches have been suggested to what some people, myself
included in the past, considered a bug. Subtle, behaviour changes due
to bug fixes are not normally considered worth backwars compatibility
preserving.

Now I have the mail open, and as I've been cited twice, I'd like to
restate that Iwithdraw my patch after finding the speed penalty and do
not consider it, or any of the others, adequate ( mine first is slow,
mine using a fscanf is just library gimnastics, funny to share for
stating what fscanf can do but not that useful, the one by Rene I
think is dependent on undocumented behaviour and I find it ( sorry
rene, just an opinion ) fugly  )

........
> Auxiliary argument:
> -- There are plenty of ways text file reading can fail (e.g. absurdly long lines) , this is just one of them. We can't fix them all so we should not fix this one.

haven't tested it, but I think the concrete example you state (
absurdly long lines ) is handled correctly by lua. It can read lines
as long as they fit in memory ( where 'fit in memory' does not mean
'4G on a 4G machine', just that a 32Mb line may fail on a 32Mb
machine, but work happily as soon as you do it on a, say, 256Meg one
).


> All arguments come down to preference and philosophy; there is nothing LOGICALLY wrong with any of them. The auxiliary argument I personally feel is bogus; I'm surprised it was suggested here to be honest.

I do not think it's been raised. It's just been stated null=!text
file, so behaviour undefined, of which unexpected is a subclass.

> But there is a PRACTICAL issue here. Text files are EXTERNAL data, and are therefore outside the control of Lua and the developer. Arguing that it's not the programs fault it exploded because "you should not have fed it a non-text file" is bogus. Taken to the extreme, you might as well omit ALL error checking in code and just crash with "user error -- aborted" panics.

You normally need some error checking to crash with that message. The
problem I see with the null issue, as demonstrated partially by my
previous example, is that io does some subtle transformations to
non-text input files which may lead to problems for some people. If
you take my sample and use trusty old cat instead of lines.lua, on a
normal xterm, you get:
>>
folarte@paqueton:~/tmp$ echo -ne 'Hello\0world\n \0No, not
really\nFrancisco\0Olarte\n\n' | cat
Helloworld
 No, not really
FranciscoOlarte
<<
This may be a source of hard to trace problems for some people.

> So how do you handle malformed text files?
> With Lua as written:
> 1. Open file in binary mode and scan it for embedded NUL characters. Fail if any found.
> 2. Reopen the file in text mode
> 3. Read lines, parsing and validating them as needed

This is only valid for files you have the luxury of opening yourself (
think io.input(), and also some of us use systems where you can open
named pipes, char special devices, ... ), and there is nothing which
guarantees opening the file in text mode will gave you the same
results.

What you would need to do is open the file in binary mode, open a
temporary file in a rewindable medium ( I think you cannot use
tmpfile() as It does not state wether it opens text or binary ), use
the 'safe' read(number) functions ( *a is not ok as input maybe
greater than ram ), read all input spooling it to tmp after checking,
rewind temp and then text read from temp file.

....

> So there are only two questions to answer:
> (1) Is the patch a significant improvement?
> (2) Is it going to be adopted?
>
> I think the answer to (1) is yes, and the answer to (2) is no. I've not seen any good, unbiased arguments as to why the answer to (1) would be no.

I do not think the patch would be a significant improvement. At least
for me, after the time spent reading liolib for this thread I would
not use it for serious file processing, just for very controlled
environments where I think the current one is ok, specially after the
inclussion of the note about null safety in *l.

Francisco Olarte.