[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Unexpected result using file:read("l")
- From: Francisco Olarte <folarte@...>
- Date: Fri, 25 Mar 2022 08:50:55 +0100
Hi Olier:
On Thu, 24 Mar 2022 at 19:55, Oliver Kroth <oliver.kroth@nec-i.de> wrote:
> opening a text file in non-binary mode (no 'b' in mode) on Linux won't
> help you with a file that was written with CR+LF as line endings.
Of course it will not. In Linux this is not a text file, so using
text-file funcions, like C fgets or lua read(*l) will not work quite
right.
Compesating is easy ( I normally do all my text processing in Perl and
I have muscle memorry for chomp;s/\r$// or tr/\r//cd for zapping all
of them.
The thing is, if you want text files you need to use text-file-aware
tools to transfer them among systems with different line-ending
conventions. You can use ascii mode in some, or use recode, or any
other tool, there are plenty of them.
But a CR-LF is not a text file in unix. Like a ^Z padded one is not
either ( they were padded this way in CP/M which kept the size off
files in sectors ). I've even used OSs where files where "typed",
record oriented like DB tables, and text files where just a subclass
where records contained a single text column. They were nice because
every language understood record oriented files, hell to interoperate
with others. The notion of what a text file is changes, and needs to
be taken into account, I can work in a DB and define my text files as
"single column varchar tables" and transfer them to unix or windows
and back easily.
> I use to snip off a terminating \r:
> line = file:read('*l')
> if not line then break end
> line = line:gsub( '\\r$', '' )
I have one provider with codes and recodes and cuts and pastes
mercilessly, so its files contain:
- Latin 1 chars ( win 1252 really )
- Utf8-sequences.
- "bicoded" utf8 ( convert 1 latin1 to 2 utf8 bytes, then treat each
byte as latin 1 and reencode in utf8 ).
- "tricoded" utf8, two pass of the above.
- Optional BOM.
- Optional "bicoded" bom ( so far no tricoded bom )
- Single \r, single \n, \r\n, \r\r\n and \n\r as line delimitters.
All off this ( except BOM, because it must only be one ) on a single
file. ( he seems to open files in different editors, key something,
save it disregarding any previos coding check ). And all can be more
or less detected and compensated and translated to unix-utf8. \r is
the easy part, as at least he does not have embeded \r in lines. I
prefilter them, and if you have to deal with lots of text it normally
pays to do it that way, so you know your files are text-files and you
can use all the text-file oriented routines in your language of
choice.
Francisco Olarte.