[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Chinese characters in a string
- From: Scott Morgan <blumf@...>
- Date: Tue, 04 Dec 2012 14:31:33 +0000
On 04/12/12 13:10, alessandro codenotti wrote:
> Hello, I moved to china some 3 months ago and now that I'm starting to
> speak the language I'm also starting writing programs that have to
> operate on strings containing chinese character and i noticed that the
> string functions behaves in a strange way:
>
> a="我叫李乐"
> print(string.sub(a,1,4)) -->我叫
> print(string.sub(a,1,5)) -->我叫?
> print(string.len(a)) -->8
>
> it seems that every chinese character is counted twice. That was not a
> problem since all the string functions behaves like this and their
> results are then compatible but I was interested in the reason behind
> this results, I guess it is somehow related to the code used to
> represent them but I'd like to know a precise explanation!
> Thank in advice for the help and sorry for the grammar mistakes i
> probably made but english is not my motherlanguage!
A quick check on you email headers shows what charset you're using:
> Content-Type: text/plain; charset="gb2312"
And checking up on GB2312 shows how it encodes characters:
http://en.wikipedia.org/wiki/GB_2312
> EUC-CN is often used as the character encoding (i.e. for external
> storage) in programs that deal with GB2312, thus maintaining
> compatibility with ASCII. Two bytes are used to represent every
> character not found in ASCII. The value of the first byte is from
> 0xA1-0xF7 (161-247), while the value of the second byte is from
> 0xA1-0xFE (161-254).
Using that you should be able to parse your Chinese text. However that
won't hold for all encodings you may encounter (Big5 for example).
Ultimately you need to be aware of how computers handle various
character-sets and, most importantly, know which charset the text you're
handling is encoded in.
Simple rules every programmer should know:
* Glyph == the image of the letter/symbol being displayed
* Character == a code point in the character-set
* Byte == the basic unit of data most code, including Lua's string lib,
works on. 8 bits
* A Glyph is one or more Characters
* A Character is one or more bytes
* There can be more than one Byte sequence representing the same Glyph
(so a simple byte comparison of two strings may return a false negative)
The details can be lot lot more involved, but that's the essence of
dealing with the variety of modern charset.
Scott