I have one provider with codes and recodes and cuts and pastes
mercilessly, so its files contain:
- Latin 1 chars ( win 1252 really )
- Utf8-sequences.
- "bicoded" utf8 ( convert 1 latin1 to 2 utf8 bytes, then treat each
byte as latin 1 and reencode in utf8 ).
- "tricoded" utf8, two pass of the above.
- Optional BOM.
- Optional "bicoded" bom ( so far no tricoded bom )
- Single \r, single \n, \r\n, \r\r\n and \n\r as line delimitters.
All off this ( except BOM, because it must only be one ) on a single
file. ( he seems to open files in different editors, key something,
save it disregarding any previos coding check ). And all can be more
or less detected and compensated and translated to unix-utf8. \r is
the easy part, as at least he does not have embeded \r in lines. I
prefilter them, and if you have to deal with lots of text it normally
pays to do it that way, so you know your files are text-files and you
can use all the text-file oriented routines in your language of
choice.