[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: code page
- From: David Given <dg@...>
- Date: Tue, 12 May 2009 23:13:28 +0100
-----BEGIN PGP SIGNED MESSAGE-----
Marco Antonio Abreu wrote:
> When a field
> value has one accented char, it truncate the last one ('FlÃvia' comes
> like 'Fl??vi' - ?? are especial chars), if the text has two accented
> chars it has the last two chars cutted and so on...
This is a classic symptom of UTF-8 misparsing.
What happens is: somebody is encoding the string as UTF-8 as follows:
46 6c c3 a1 76 69 61
Note that the 'Ã' is encoded as two bytes (c3 a1). However, then someone
is parsing this as if it's ISO-8859-1 (a.k.a. Latin-1), which comes out as:
Those two bytes are now interpreted as two distinct code points.
However, now we have one code point too many, so the last one (the 'a')
You should probably check each stage of your pipeline to make sure that
it's receiving and accepting the right encoding --- it sounds like
something's getting it wrong.
ââââ ïïïïïïïïïïïïïï âââââ http://www.cowlark.com âââââ
â "People who think they know everything really annoy those of us who
â know we don't." --- Bjarne Stroustrup
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
-----END PGP SIGNATURE-----