Re: io:lines() and \0

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: io:lines() and \0
From: Francisco Olarte <folarte@...>
Date: Thu, 20 Feb 2014 21:24:47 +0100

Hi Sean:

On Thu, Feb 20, 2014 at 7:29 PM, Sean Conner <sean@conman.org> wrote:
> It was thus said that the Great Francisco Olarte once stated:
>> understood, but not forgiven. For me fgets(buf, size, file) should be
>> equivalent to a getc loop with some checkings for size, \n and EOF.
>> And, from what we've seen on this thread, it seems the libC
>> implementation do it that way. Is lua lib which does not.
>   No, the Lua lib uses fgets().

I should have made it more explicit. If you read 'ab\000cd\012' on
unix, you end up with a buffer containing 'ab\0cd\n'. In C you are not
going to be able to distinguish it with str*, ok, that's a C problem,
that's why you never use fgets when you need to be null resistant in
C.

lua io lib uses, but does not need to use, fgets. I, personally, would
expect for a language like lua which support strings with embeded nuls
to support that, and give me back a 'ab\0cd\n', not 'ab', missing
everything ( you could even think you where at end of file if you took
this kind of hints). This is why I proposed my trivial loop, which I
thought would be better. I still think it is, as I doubt it will make
a difference in real use cases, seeing my modest 2.4G core 2 was able
to process 83 cached Mb in .8secs. But, as maintaining the perfomance
seems to be more important than correctly processing nuls ( which may
be not so important for a lot of peole ), I backpedaled.

> The issue is that fgets() returns a
> pointer, not a size, and thus, any embedded '\0' in the data are
> problematic, because in C, strings are teminated by '\0'.

As I said before, the issue is lua uses fgets because it prefers speed
over nul safety. We all presently know how to do it with getc. There
are other alternatives, like managing the buffering inside lua ( if I
did not read the source incorrectly the tema already haves more than
half the infrastructure in place in lzio* for feeding the lexer ), but
this would be complicating things, and introducing potential bugs, and
would be not thread safe ( but anyway, neither is the current
implementation, which is logical when targetting thread less ANSI C,
read_number seems thread safe, single fscanf call, read_chars seems
to, single fread call, but both read_all ( multiple fread ) and
read_lines ( multiple fgets ) seem like someone could squish between
calls and lead to missing blocks in the middle ). I do not think doing
this is worth the hassle.

The only thing I could propose is, given lua sidesteps ANSI a bit (
LUA_USE_POPEN / LUA_WIN ) and we have a nice macro to test for POSIX (
LUA_USE_POSIX ) is to make posix ( Although they seem to need a recent
POSIX acording to my man pages) versions using getline of the affected
functions, and document it. I could try to use unlocked_stdio for
another version, but this is not an easy task, given lual_prepbuffer
can setjmp around and fsck the locking.

Other possibility, maybe using trusty fscanf. If we do
'fscanf("%*[^\n]%p,bufsize-1,p,&len)' it may do a similar thing as
getline() or fgets() but returning the consumed length, but I'm not
sure ANSI demands supporting this. It has some problems, but I think
I've got a solution which could be nicely packed. Stay tuned as it
will make for a long message.

>> I could tolerate if it interpreted '\0' as '\n', heck, I did tolerate
>> MSC discarding \015 ( which is not the same as mapping '\015\012' to
>> '\n' ), but reading past the null and then discarding the chars is too
>> much.
>
>   How is discarding '\015' any different from mapping "\015\012" to "\012"?

Isn't obvious? The simpler example: '\015' => '', some more:
'ab\015cd\015\012' => abcd\n vs. ab\rcd\n

>> I think the main problem with lua now would be it does not clearly
>> specify file with embeded nuls are not safe to read by lines.
>   I'm not even sure the C Standard covers that.

Well, getting a corrigendum, or whatever is called, to the C standard
seems way more difficult than getting Roberto and his team to include
a 'beware of nulls in input' on the lua reference manual.

>> And it is a shame The C library does not say anything about wether
>> fgets() modifies any part of the buf PAST the null it inserted,
>> otherwise we could use memset(anything) and then search for the nul
>> from the end of the string:
>   The problem with that is if the file in question has multiple NUL byte
> runs (enough to fill a buffer, or even an unfortunate alignment where the
> last byte read in the buffer is NUL).

Not an issue. If C guaranteed me fgets would not touch the buffer
after the null, I can fill it with ones, and as I know it MUST have a
null at the end I can scan backwards, the first one is the terminating
null.

>> But i would bet one day after putting this on the wild someone fires
>> it to a library which, say, helpfully zeroes the whole buf before
>> reading to aid in debug.
>   Nah, for debugging purposes, you fill memory (via malloc() or
> on the stack) with a non-0 pattern [1].

<I> do, and possibly <you> and <a lot of people> do, using ypur quoted
0xCC, the typical 0xdeadbeaf or 0xa5a5a5a5, but I'm nearly sure there
is one which nullifies it.

> [1]     0xCC on x86; 0xA5 on just about anything else.  Why?  On x86, 0xCC
>         is INT 3 (single byte instruction), which will be caught by the OS.

Only if executed, but it's good anyway.

Francisco Olarte.

Follow-Ups:
- Re: io:lines() and \0, Sean Conner

References:
- Re: io:lines() and \0, Craig Barnes
- Re: io:lines() and \0, René Rebe
- Re: io:lines() and \0, Craig Barnes
- Re: io:lines() and \0, Sean Conner
- Re: io:lines() and \0, René Rebe
- Re: io:lines() and \0, René Rebe
- Re: io:lines() and \0, Sean Conner
- Re: io:lines() and \0, Francisco Olarte
- Re: io:lines() and \0, Enrico Colombini
- Re: io:lines() and \0, Francisco Olarte
- Re: io:lines() and \0, Sean Conner

Prev by Date: Re: io:lines() and \0
Next by Date: Re: io:lines() and \0
Previous by thread: Re: io:lines() and \0
Next by thread: Re: io:lines() and \0
Index(es):
- Date
- Thread