[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Lost in Unicode
- From: RLake@...
- Date: Mon, 20 Oct 2003 11:15:58 -0500
On Monday 20 October 2003 15:41, Reuben Thomas wrote:
> There is a way, because ISO-8859-1 files are invalid unicode.
Y Enrico preguntó:
> Even if a 2-character sequence (in the high range) happens
> to be the same as a valid Unicode character?
It's possible, although unlikely. For example, the legal ISO-8859-1
sequence ¡
(0xC2A1) is UTF-8 for the ISO-8859-1 character ¡ (0xA1); in fact, you
could think
of  as a sort of superquote for ISO-8859-1 characters in the range 0xA0
through 0xBF.
> By the way, I gather that Roberto's "toISO" function would not work
correctly
> if a "combining character" is encountered (e.g. "e" followed by
"combining
> dieresis") instead of a single UTF-8 character ("e with dieresis").
> Are they commonly used in editors?
They shouldn't be, in the case that a composition character exists.
However,
it would probably depend on the language of the editor -- for example,
there
is some duplication between ISO-8859-1 and ISO-8859-2, and I wouldn't want
to speculate on how an editor written or configured for an "ISO-8859-2"
language might work.
Roberto's function will also fail, possibly more seriously, on characters
outside of the ISO-8859-1 range; in particular, the code page typically
used
by non-Unicode OS's uses high-control characters (in the range 0x80 to
0x9F)
for additional graphics characters whose Unicode code points are outside
of
the two-byte UTF-8 range. In particular, typographic single and double
quotes
will not translate properly, nor will typographic em dashes, and those are
characters typically inserted by editors (or at least by MS Word).
Rici.