lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On 4 December 2012 14:10, alessandro codenotti <code95@live.it> wrote:
Hello, I moved to china some 3 months ago and now that I'm starting to speak the language I'm also starting writing programs that have to operate on strings containing chinese character and i noticed that the string functions behaves in a strange way:

a="我叫李乐"
print(string.sub(a,1,4)) -->我叫
print(string.sub(a,1,5)) -->我叫?
print(string.len(a)) -->8

it seems that every chinese character is counted twice. That was not a problem since all the string functions behaves like this and their results are then compatible but I was interested in the reason behind this results, I guess it is somehow related to the code used to represent them but I'd like to know a precise explanation!

Lua strings are essentialy (immutable) arrays of bytes, not Unicode characters, like in other languages (such as Java). Lua only sees a sequence of bytes. What makes us interpret those bytes to characters is an *encoding*, such as ASCII, UTF-8. What encoding you wrote that string in depends on your system environment and editor in which you created the file. Please read [1] for more info. Since you are using chinese, several encodings are possible, like Unicode variants (UTF-8, UCS2), Big5, GB18030, etc. [2]

[1] http://www.joelonsoftware.com/articles/Unicode.html 
[2] http://en.wikipedia.org/wiki/CJK

When you work with the string in Lua, the string may be actually a lot longer than the 4 characters you have written. I use UTF-8 by default, so I will speak about this encoding specifically. This is the how the bytes forming my string look in memory:

> for i=1,#a do print(i, a:sub(i,i):byte()) end
1 230
2 136
3 145
4 229
5 143
6 171
7 230
8 157
9 142
10 228
11 185
12 144

Each character here is essentialy 3 bytes, which form a UTF-8 encoded character. Since my system uses UTF-8 and my string is 12 bytes long, you are using a different encoding (you have to find out which one, but it is some kind of 16bit encoding), but the interpretation of the string is the same.

When you are accessing bytes 1-4, you are essentially looking at the first two characters. When you add in another byte, it does not form a complete character, and the system usually displays it as a question mark.

To sum up - Lua operates on bytes, not characters - you have to know which encoding you are using. You can use the lua-iconv [3] module to convert between different encodings (like to and from UTF-8), and slnunicode [4] has methods to work with UTF-8 encoded strings on character basis, implementing functions such as sub, gsub, match... 

[3] http://ittner.github.com/lua-iconv/
[4] https://github.com/LuaDist/slnunicode
 
p.s.
I hope chinese character works fine on the mailing list or this post will look quite messed up...

They display correctly to me (compared to the png you sent)