[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: io:lines() and \0
- From: Tom N Harris <telliamed@...>
- Date: Fri, 21 Feb 2014 03:06:49 -0500
On Friday, February 21, 2014 12:29:37 AM Philipp Janda wrote:
> I may be wrong, but isn't there some 16-bit encoding where every other
> byte is zero for ASCII characters (UCS-2, UTF-16, or something)?
>
That may be the case, but trying to read such an encoding will get you in
trouble because the CR+NL is represent in 16-bit characters also. So a line
would be terminated with the bytes (in little-endian mode) 0D 00 0A 00. fgets
will stop at the 0A which is a malformed character, then the next read will
start with 00 and the rest of your text is garbled.
In no text encoding (that I know of) where an end-of-line is just 0A, and thus
can be read by fgets, does a valid string contain 00. Anything else must be
treated as not-text even if it is an encoding of text. Otherwise you'd break
the encoding like shown above.
There isn't a reliable way to recover a complete "line" that may contain any
byte using fgets. Any trick with padding will always run into the corner case
of a file ending with the padding character and no EOL. The replacements such
as getline are not part of C89. So to make the io library read those lines
(not only io.lines but file:read"*l") it would have to forego fgets and read
the file a byte at a time. Reading individual characters from a stream is
notoriously slow. So if you don't know that your file contains only text in a
simple encoding, you should treat it like arbitrary data and read large chunks
into Lua then split them.
The "fix", as was mentioned some days ago, is to add a note to the manual that
the line reading functions don't work if the line to be read may contain non-
text characters such as NULL, CR, or Ctrl+Z. In other words:
A man said to the doctor, "It hurts when I move my arm like this."
Said the doctor, "Then don't do that."
--
tom <telliamed@whoopdedo.org>