lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Am 21.02.2014 09:06 schröbte Tom N Harris:
On Friday, February 21, 2014 12:29:37 AM Philipp Janda wrote:
I may be wrong, but isn't there some 16-bit encoding where every other
byte is zero for ASCII characters (UCS-2, UTF-16, or something)?

That may be the case, but trying to read such an encoding will get you in
trouble because the CR+NL is represent in 16-bit characters also. So a line
would be terminated with the bytes (in little-endian mode) 0D 00 0A 00. fgets
will stop at the 0A which is a malformed character, then the next read will
start with 00 and the rest of your text is garbled.

In no text encoding (that I know of) where an end-of-line is just 0A, and thus
can be read by fgets, does a valid string contain 00. Anything else must be
treated as not-text even if it is an encoding of text. Otherwise you'd break
the encoding like shown above.

I agree. It was only meant as an example of a "text file" containing NUL bytes where the concept of lines may still be relevant. (Although, if you are prepared to handle some trailing/leading NUL bytes each line, splitting at '\n' bytes should still "work" ...)

The "fix", as was mentioned some days ago, is to add a note to the manual that
the line reading functions don't work if the line to be read may contain non-
text characters such as NULL, CR, or Ctrl+Z. In other words:

     A man said to the doctor, "It hurts when I move my arm like this."
     Said the doctor, "Then don't do that."

I'm not sure whether that joke is supposed to prove your point or mine, but whatever ... What if somebody else moves your arm? You often don't have control over the files your programs open (and the people who do have control might not read the Lua reference manual). Say for example, you process a text file (letters and whitespace only) but somehow a single NUL character is in it (via cosmic rays, or the unfortunate combination of keyboard shortcuts and big fingers). If `file:lines()` returns all data including the NUL I can throw a parse error (will probably happen automatically if I use pattern matching or LPeg on the lines). With the current approach I can only detect that case if the missing data makes my line malformed or if I scan the file using some other method.

Unless we can all agree that `file:lines` is for text files in toy programs only, where detecting invalid input is not that important.

Btw., there was a related security hole[1] with certificate requests where the Common Name (a data+length string in the spec) contains a NUL byte and is compared via C functions (stopping at the first NUL).


  [1]:  (a very nice talk, btw!)