lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Friday, February 21, 2014 12:29:37 AM Philipp Janda wrote:
> I may be wrong, but isn't there some 16-bit encoding where every other
> byte is zero for ASCII characters (UCS-2, UTF-16, or something)?
> 

That may be the case, but trying to read such an encoding will get you in 
trouble because the CR+NL is represent in 16-bit characters also. So a line 
would be terminated with the bytes (in little-endian mode) 0D 00 0A 00. fgets 
will stop at the 0A which is a malformed character, then the next read will 
start with 00 and the rest of your text is garbled.

In no text encoding (that I know of) where an end-of-line is just 0A, and thus 
can be read by fgets, does a valid string contain 00. Anything else must be 
treated as not-text even if it is an encoding of text. Otherwise you'd break 
the encoding like shown above.

There isn't a reliable way to recover a complete "line" that may contain any 
byte using fgets. Any trick with padding will always run into the corner case 
of a file ending with the padding character and no EOL. The replacements such 
as getline are not part of C89. So to make the io library read those lines 
(not only io.lines but file:read"*l") it would have to forego fgets and read 
the file a byte at a time. Reading individual characters from a stream is 
notoriously slow. So if you don't know that your file contains only text in a 
simple encoding, you should treat it like arbitrary data and read large chunks 
into Lua then split them.

The "fix", as was mentioned some days ago, is to add a note to the manual that 
the line reading functions don't work if the line to be read may contain non-
text characters such as NULL, CR, or Ctrl+Z. In other words:

    A man said to the doctor, "It hurts when I move my arm like this."
    Said the doctor, "Then don't do that."

-- 
tom <telliamed@whoopdedo.org>