lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


> but i'm still confusing, UTF-8 format is not wide character version, and it's unicode, 
> and microsoft's wide character isn't unicode? 
> UTF-8 is multi-byte and microsoft's wide character is 2-byte array..

Unicode is a character set which maps integer numbers to characters. In theory 
there is not limit to the size of Unicode since it does not define any 
representation. However it is extremely unlikely any characters will be mapped 
numbers greater than 2^31 (2,147,483,648), and there are proposals to limit 
this to 2^21 (2,097,152).

Initially Unicode was limited to 2^16 positions (65,536), but this was found 
to be inadequate. The first 2^16 characters of Unicode are known as the Basic 
Multilingual Plane (BMP) and is intended be enough to represent all living 
languages, however as other messages have suggested it does not contain 
historical characters. This space is not yet full so there may be further 
characters added in the future.

Unicode doesn't specify a way for the integers representing characters to be 
encoded so there are a number of options. Windows was designed when Unicode 
characters were only 16bit long so are encoded as two bytes, therefore can 
only represent the BMP. This is what Microsoft's wide character representation 
is (sometimes called UCS-2).

UTF-8 is a variable length encoding which can represent the whole Unicode 
codespace (1-3 bytes for the BMP, 1-4 bytes for 21 bit Unicode, 1-6 bytes for 
31 bit Unicode). It has several features that are good for backwards 
compatibility. For details see: http://www.cl.cam.ac.uk/~mgk25/unicode.html

UTF-16 is another encoding which can represent the whole Unicode codespace as 
2 or 4 bytes per character.

There are many other encodings but most applications use these two.

For more information read:
http://www.cl.cam.ac.uk/~mgk25/unicode.html
and
http://www.unicode.org/

Hope his helps,
Steven Murdoch.