[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: lua for unicode
- From: lua+Steven.Murdoch@...
- Date: Mon, 02 Dec 2002 12:12:46 +0000
> but i'm still confusing, UTF-8 format is not wide character version, and it's unicode,
> and microsoft's wide character isn't unicode?
> UTF-8 is multi-byte and microsoft's wide character is 2-byte array..
Unicode is a character set which maps integer numbers to characters. In theory
there is not limit to the size of Unicode since it does not define any
representation. However it is extremely unlikely any characters will be mapped
numbers greater than 2^31 (2,147,483,648), and there are proposals to limit
this to 2^21 (2,097,152).
Initially Unicode was limited to 2^16 positions (65,536), but this was found
to be inadequate. The first 2^16 characters of Unicode are known as the Basic
Multilingual Plane (BMP) and is intended be enough to represent all living
languages, however as other messages have suggested it does not contain
historical characters. This space is not yet full so there may be further
characters added in the future.
Unicode doesn't specify a way for the integers representing characters to be
encoded so there are a number of options. Windows was designed when Unicode
characters were only 16bit long so are encoded as two bytes, therefore can
only represent the BMP. This is what Microsoft's wide character representation
is (sometimes called UCS-2).
UTF-8 is a variable length encoding which can represent the whole Unicode
codespace (1-3 bytes for the BMP, 1-4 bytes for 21 bit Unicode, 1-6 bytes for
31 bit Unicode). It has several features that are good for backwards
compatibility. For details see: http://www.cl.cam.ac.uk/~mgk25/unicode.html
UTF-16 is another encoding which can represent the whole Unicode codespace as
2 or 4 bytes per character.
There are many other encodings but most applications use these two.
For more information read:
Hope his helps,