Re: io:lines() and \0

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: io:lines() and \0
From: Sean Conner <sean@...>
Date: Wed, 19 Feb 2014 12:26:00 -0500

It was thus said that the Great René Rebe once stated:
> 
> On Feb 19, 2014, at 14:59 , steve donovan wrote:
> 
> > On Wed, Feb 19, 2014 at 3:47 PM, Enrico Colombini <erix@erix.it> wrote:
> >> As has been noted, text functions (regardless of what their behaviour on
> >> observed systems may be) only guarantee that printable characters, leading
> >> spaces and the few control characters listed by the standard are preserved:
> >> the printable information.
> > 
> > 
> > +1.  I would not expect any other interpretation, and we understand
> > that Lua is built on standard C and its limitations.  Otherwise it
> > will grow its own libc and then it will no longer be the little beast
> > we know and love.
> 
> I do not really understand how there can be so much argument for something
> as simple as this. This bug only exists because this old-fashioned C function
> does not return how much data was read. If this C function would return a
> pointer to the end of the data (like the new getline(3)) we would not even have
> an discussion now. Amazing how much energy goes into preventing other
> people to improve something; fix an issue.

  First off, you are attempting to read a binary file (contains data that
isn't printable characters and the few defined control characters in the C
Standard) as a text file, which is undefined behavior.  Just because you are
experiencing some particular behavior with one C runtime library doesn't
mean all C runtime libraries behave the same. [1]

> Actually the C function behaves properly, just that Lua makes no attempt to
> determine how much was actually read. 

  Not really.  Given the following file:

00000000: 54 68 65 20 71 75 69 63 6B 20 00 72 6F 77 6E 20 The quick .rown 
00000010: 66 6F 78 20 6A 75 6D 70 73 20 6F 76 65 72 20 74 fox jumps over t
00000020: 68 65 20 6C 61 7A 79 20 64 6F 67 2E 0A 54 68 65 he lazy dog..The
00000030: 20 71 75 69 63 6B 20 00 72 6F 77 6E 20 66 6F 78  quick .rown fox
00000040: 20 6A 75 6D 70 73 20 6F 76 65 72 20 74 68 65 20  jumps over the 

  Note that this file *could* be said to have two lines, each with a NUL
byte in the 11th position.  I say "could" because if you look closely, you
will note that the last "line" does *not* contain an end-of-line marker,
which is important for this test.  

  Okay, so let's read in the first line with fgets():

00000000: 54 68 65 20 71 75 69 63 6B 20 00 72 6F 77 6E 20 The quick .rown 
00000010: 66 6F 78 20 6A 75 6D 70 73 20 6F 76 65 72 20 74 fox jumps over t
00000020: 68 65 20 6C 61 7A 79 20 64 6F 67 2E 0A 00 00 00 he lazy dog.....
00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................

  Okay, so fgets() checks for '\n' and EOF *on my particular C runtime*.
And we can do the memchr() for the trailing '\n' and get the length.  Now
let's read in the next line:

00000000: 54 68 65 20 71 75 69 63 6B 20 00 72 6F 77 6E 20 The quick .rown 
00000010: 66 6F 78 20 6A 75 6D 70 73 20 6F 76 65 72 20 74 fox jumps over t
00000020: 68 65 20 00 61 7A 79 20 64 6F 67 2E 0A 00 00 00 he .azy dog.....
00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................

  Hmm ... okay ... fgets() read the "line" (even though it wasn't terminated
as a line).  But, how do we determine the actual end of the line?  The
actual end of the line is at offset 0x22 of the buffer, but we have garbage
left over from the previous line and if we check for the '\n', we'll find
it, but we'll get garbage at the end of the "line".

  Okay, but what about zeroing out the buffer prior to use?  Okay, we can do
that, but:

00000000: 54 68 65 20 71 75 69 63 6B 20 00 72 6F 77 6E 20 The quick .rown 
00000010: 66 6F 78 20 6A 75 6D 70 73 20 6F 76 65 72 20 74 fox jumps over t
00000020: 68 65 20 00 00 00 00 00 00 00 00 00 00 00 00 00 he .............
00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................

  Without prior knowledge (know that this is the last line, for instance)
how do we "know" that the "line" only has one NUL byte and not 30 NUL bytes? 
Or was it 2 NUL bytes?

> And instead of just fixing this bug we
> end up arguing -Steve Jobs like- that I am holding the binary text file wrong.
> 
> And again, most system including Linux and Mac ignore this text mode
> nonsense to start with and simply tread all files as binary files no matter
> how long you want to call some files text tiles.

  There is one system, which arguably has a larger share of the market than
both Linux and Mac combined, that does *NOT* ignore the difference between
text and binary modes [3].  So this isn't quite as theoretical as it
appears.

  -spc (I was actually bitten with a text mode/binary mode difference 
	rather recently, because of that third system ... )

[1]	But to be fair, most implementations would probably work simularly,
	operating system details aside. I just checked the implementation of
	fgets() for the Small-C [2] runtime library and it looks like it
	would behave the same as the GNU libc I'm using for these tests.

[2]	I'm not sure the age of the actual implementation I have, but the
	original Small C compiler was released in 1980.

[3]	And if you think a NUL byte shouldn't be problematic in text mode,
	try reading SUB (character 26) with that third system which shall
	remain nameless, but who's initials are Microsoft Windows.

Follow-Ups:
- Re: io:lines() and \0, Roberto Ierusalimschy

References:
- io:lines() and \0, René Rebe
- Re: io:lines() and \0, steve donovan
- Re: io:lines() and \0, René Rebe
- Re: io:lines() and \0, Enrico Colombini
- Re: io:lines() and \0, steve donovan
- Re: io:lines() and \0, René Rebe

Prev by Date: Re: ++ and +=
Next by Date: Re: ++ and +=
Previous by thread: Re: io:lines() and \0
Next by thread: Re: io:lines() and \0
Index(es):
- Date
- Thread