Re: io:lines() and \0

On Feb 20, 2014, at 22:51 , Sean Conner wrote:

It was thus said that the Great Francisco Olarte once stated:

I should have made it more explicit. If you read 'ab\000cd\012' on
unix, you end up with a buffer containing 'ab\0cd\n'. In C you are not
going to be able to distinguish it with str*, ok, that's a C problem,
that's why you never use fgets when you need to be null resistant in
C.

And what I (and a few others) are arguing, is that C makes a distinction
between a text file, and a binary file. Using functions meant for text
files on binary files are not specified to return meaningful results.

It makes the destination that the internal and external storage may be different

- which is mostly due to line ending, and does not specify it further.

The reason for the distinction is that different systems used different
methods to mark the end of a line, and different methods to mark the end of
a text file. In order to standardize to a known set of behaviors on wildly
different systems [2] without breaking too much existing code.

This implementation details are supposed to be handled by the system

"kernel" and C library.

The fact that it *almost* works in C is irrelevant. Lua is targetted
towards C89, and expecting functions that work for text files to work for
binary files is expecting too much. And granted, the Lua documentation
should probably mention this.

It works perfectly well in C.

I could tolerate if it interpreted '\0' as '\n', heck, I did tolerate
MSC discarding \015 ( which is not the same as mapping '\015\012' to
'\n' ), but reading past the null and then discarding the chars is too
much.

How is discarding '\015' any different from mapping "\015\012" to "\012"?

Isn't obvious? The simpler example: '\015' => '', some more:
'ab\015cd\015\012' => abcd\n vs. ab\rcd\n

I think this is where we differ---if I know I'm reading a binary file, I
open as a binary file, and avoid fgets(), since it's a binary file---either
it has no structure so using fgets() is silly (and I use fread() or
fgetc()), or it has a (to me) known structure, so using fgets() is still
silly (and I use fread() or fgetc()).

The standard does NOT say fgets is only for streams in text mode.

The problem with that is if the file in question has multiple NUL byte
runs (enough to fill a buffer, or even an unfortunate alignment where the
last byte read in the buffer is NUL).

Not an issue. If C guaranteed me fgets would not touch the buffer
after the null, I can fill it with ones, and as I know it MUST have a
null at the end I can scan backwards, the first one is the terminating
null.

Sigh. That *still* wouldn't work. Assume (for sake of argument) a buffer
size of 8 bytes. You fill it with all ones (0xFF):

FF FF FF FF FF FF FF FF

And you read the following binary file using your version of fgets():

34 89 00 FF 23 08 FF FF

So the buffer now contains:

34 89 00 FF 23 08 FF FF

and thus you return:

34 89 00 FF 23 08

which is *NOT* the correct data (it's truncated).

Your example is again wrong, the standard explicitly says:

The fgets function reads at most one less than the number of characters specified by n from the stream pointed to by stream into the array pointed to by s.

and

A null character is written immediately after the last character read into the array.

So the buffer would be:

34 89 00 FF 23 08 FF 00

and one would continue reading until \n or EOF. The current Lua code already does it.

Actually the current Lua code would likely crash with your hypothetical buffer return because the current Lua code already scans for \0 and thus would overrun your buffer if it is not \0 terminated.

A binary file can be expected to have any value, so any value you use a
"filler" can lead to data truncation (I'm not saying it always will lead to
data truncation, but that it can).

A text file can be expected to have any value likewise, just that the quote:

The external representations in a text file need not be identical to the internal representations, ...

But i would bet one day after putting this on the wild someone fires
it to a library which, say, helpfully zeroes the whole buf before
reading to aid in debug.
Nah, for debugging purposes, you fill memory (via malloc() or
on the stack) with a non-0 pattern [1].

<I> do, and possibly <you> and <a lot of people> do, using ypur quoted
0xCC, the typical 0xdeadbeaf or 0xa5a5a5a5, but I'm nearly sure there
is one which nullifies it.

As I mentioned, I pick the value depending on the CPU architecture, with
an eye towards crashing if the value(s) is(are) executed, used as an index,
as a pointer, or printed (not likely for printing, but seeing odd results is
still helpful).

-spc

[1] NOT USED HERE

[2] For some real fun, check out old computer related magazines [3]
prior to 1989 (ratification of the ANSI C Standard).

[3] https://archive.org/details/computermagazines A good one would be
Byte Magazine [4].

[4] https://archive.org/details/byte-magazine