Re: io:lines() and \0

On Feb 21, 2014, at 24:29 , Philipp Janda wrote:

Hi!

This seems to be a fun discussion ... :-)

yes, a LOT of fun ;-!

Am 20.02.2014 21:03 schröbte Dirk Laurie:
2014-02-20 21:44 GMT+02:00 René Rebe <rene@exactcode.de>:

The discussion is about lines(), that it using fgets is just an
implementation detail.

If Roberto would not kind of implied performance loss is not that acceptable
with his bible test case then a fgetc() look without all this troubles would
have been very fine for me, too.

I can certainly give up improving vanilla Lua and convincing some that
random data loss is usually considered a bug, and live very happily with the
fix that works for me just fine.

Have fun parsing MIME, CGI data, or financial programs exports using \0
field delimiters. Or wherever a zero comes along.

It is useful to look again at the start of the post where it all started.

I just noticed that io:lines() does not cope with \0 in the lines

Allow me to summarize the facts.

1. io.lines operates on text files.

`io.lines` operates on any file you throw in its way. It *opens* the files as text streams, but that is something you will only find out if you read the source code. The manual does not specify this (except for `io.lines()` without arguments, which uses `io.input`). Apparently `file:lines` does not raise an error when used on a binary file object either (which is a good thing for anybody using non-ASCII characters) ...

I may be wrong, but isn't there some 16-bit encoding where every other byte is zero for ASCII characters (UCS-2, UTF-16, or something)?

I think UTF-16, and would assume the ASCII subset probably has on half of each 16-bit world 0.

2. Text files may not contain any control character except whitespace.

That is your definition. The Lua manual does not contain that (or any other) definition. AFAIK even ISO C does not have a definition. I found one for POSIX[1], but that one is different from yours.
It is true that Lua cannot do better than the underlying C library, but ISO C does not forbid the C library to do better than the lowest common denominator specified by ISO C for text streams.

I also think that defining text files for Lua won't help much, because you can only verify that a file is a proper text file by opening it in binary mode and checking every character. And the alternative (silent data loss) may be difficult to detect from within a program …

The ISO C draft I found explicitly only lists:

The external representations in a text file need not be identical to the internal representations, …

And does not further limit the storage of \0 or other control characters. My understanding is that this was mostly intended for line endings.

3. \0 is not whitespace.

That one we agree on.

In other words, the behaviour complained of is that a standard library
routine when given data that does not conform to specification gives
undefined results.

Currently there is no relevant specification other than the source code or a collection of mailing list posts.

Regarding performance: If I needed maximum read performance I would bind `mmap`. I think `lua-apr`[2] contains a file-like binding, anyone knows of any other? But I suspect that all this performance would be wasted anyway: text files[*] usually don't get that big (unless you mis-configure `logrotate`).
I don't have the bible installed, but the largest text file I could find on my computer is the `ngerman` dictionary with 4.3 Mb. My largest logfile in `/var/log/*` is 1 Mb (`kern.log.1`) …

My latest patch does not even alter the performance much. Hard to measure on a Turbo boosting Intel Core on the famous bible test case.

Philipp

[1]: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_397
[2]: http://peterodding.com/code/lua/apr/docs/#shared_memory

[*]: Files primarily intended to be read by humans.

--
ExactCODE GmbH, Jaegerstr. 67, DE-10117 Berlin
http://exactcode.com | http://exactscan.com | http://ocrkit.com | http://t2-project.org | http://rene.rebe.de