Re: Lost in Unicode

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Lost in Unicode
From: RLake@...
Date: Mon, 20 Oct 2003 11:15:58 -0500

 On Monday 20 October 2003 15:41, Reuben Thomas wrote:
> There is a way, because ISO-8859-1 files are invalid unicode. 

Y Enrico preguntó:

> Even if a 2-character sequence (in the high range) happens
> to be the same as a valid Unicode character?

It's possible, although unlikely. For example, the legal ISO-8859-1 
sequence Â¡
(0xC2A1) is UTF-8 for the ISO-8859-1 character ¡ (0xA1); in fact, you 
could think
of Â as a sort of superquote for ISO-8859-1 characters in the range 0xA0
through 0xBF.

> By the way, I gather that Roberto's "toISO" function would not work 
correctly 
> if a "combining character" is encountered (e.g. "e" followed by 
"combining 
> dieresis") instead of a single UTF-8 character ("e with dieresis"). 
> Are they commonly used in editors?

They shouldn't be, in the case that a composition character exists. 
However,
it would probably depend on the language of the editor -- for example, 
there 
is some duplication between ISO-8859-1 and ISO-8859-2, and I wouldn't want
to speculate on how an editor written or configured for an "ISO-8859-2"
language might work.

Roberto's function will also fail, possibly more seriously, on characters
outside of the ISO-8859-1 range; in particular, the code page typically 
used
by non-Unicode OS's uses high-control characters (in the range 0x80 to 
0x9F)
for additional graphics characters whose Unicode code points are outside 
of
the two-byte UTF-8 range. In particular, typographic single and double 
quotes
will not translate properly, nor will typographic em dashes, and those are
characters typically inserted by editors (or at least by MS Word).

Rici.

Follow-Ups:
- Re: Lost in Unicode, Enrico Colombini

Prev by Date: RE: callback implementation details..
Next by Date: Re: require and loops
Previous by thread: Re: Lost in Unicode
Next by thread: Re: Lost in Unicode
Index(es):
- Date
- Thread