lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Tue, Dec 4, 2012 at 8:10 AM, alessandro codenotti <code95@live.it> wrote:
> Hello, I moved to china some 3 months ago and now that I'm starting to speak
> the language I'm also starting writing programs that have to operate on
> strings containing chinese character and i noticed that the string functions
> behaves in a strange way:
>
> a="我叫李乐"
> print(string.sub(a,1,4)) -->我叫
> print(string.sub(a,1,5)) -->我叫?
> print(string.len(a)) -->8
>
> it seems that every chinese character is counted twice. That was not a
> problem since all the string functions behaves like this and their results
> are then compatible but I was interested in the reason behind this results,
> I guess it is somehow related to the code used to represent them but I'd
> like to know a precise explanation!
> Thank in advice for the help and sorry for the grammar mistakes i probably
> made but english is not my motherlanguage!
>
> p.s.
> I hope chinese character works fine on the mailing list or this post will
> look quite messed up...

Welcome to the "wonderful" world of Unicode. Read about UTF-8 and
understand that Lua string functions operate on bytes, not on
characters, and characters can be more than one byte. (Then read about
combining characters, the numerous different ways to encode the same
glyph, and the different encodings in use by different systems, and
try to cling to your sanity...)

-- 
Sent from my Game Boy.