Re: code page

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: code page
From: David Given <dg@...>
Date: Tue, 12 May 2009 23:13:28 +0100

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Marco Antonio Abreu wrote:
> When a field
> value has one accented char, it truncate the last one ('Flávia' comes
> like 'Fl??vi' - ?? are especial chars), if the text has two accented
> chars it has the last two chars cutted and so on...

This is a classic symptom of UTF-8 misparsing.

What happens is: somebody is encoding the string as UTF-8 as follows:

46 6c c3 a1 76 69 61

Note that the 'á' is encoded as two bytes (c3 a1). However, then someone
is parsing this as if it's ISO-8859-1 (a.k.a. Latin-1), which comes out as:

FlÃ¡via

Those two bytes are now interpreted as two distinct code points.
However, now we have one code point too many, so the last one (the 'a')
is discarded.

You should probably check each stage of your pipeline to make sure that
it's receiving and accepting the right encoding --- it sounds like
something's getting it wrong.

- --
┌─── ｄｇ＠ｃｏｗｌａｒｋ．ｃｏｍ ───── http://www.cowlark.com ─────
│
│ "People who think they know everything really annoy those of us who
│ know we don't." --- Bjarne Stroustrup
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFKCfSFf9E0noFvlzgRAit9AKCqXOUrbpWR5qweRLmQhfRXmnhlVgCfSmMn
r1shX4gBRP6YIeZ4HwIupnk=
=0UxC
-----END PGP SIGNATURE-----

Follow-Ups:
- Re: code page, Ignacio Burgueño

References:
- code page, Marco Antonio Abreu
- Re: code page, Ignacio Burgueño
- Re: code page, Marco Antonio Abreu
- Re: code page, Ignacio Burgueño
- Re: code page, Ignacio Burgueño
- Re: code page, Marco Antonio Abreu

Prev by Date: Re: code page
Next by Date: mutually exclusive table entries
Previous by thread: Re: code page
Next by thread: Re: code page
Index(es):
- Date
- Thread