lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

becoming a little bit OT, but ...

On Thu, Sep 14, 2006 at 08:03:42AM -0500, Javier Guerra wrote:
> UTF-8 means that unicodes characters 0-127 are written in a single byte, 
> exactly matching ASCII.  higher codes start with 128-191 (6 bits) and a 
> second byte. even higher codes are done using higher values for the first 
> byte and more bytes after that.
Those in 128-191 (starting bits 10...) are never the first byte.
The number of high bits set gives the length of a byte sequence.

> UTF-16 uses two-bytes as the minimum character with.  it uses the same 
> strategy of setting the first few bits of the first character to add more 
> bytes to a character.
UTF-16 does not add bytes and the first few bits do not give the length.
Instead it uses surrogate pairs consisting of two 16bit characters
in special ranges to encode code points between 64K and 1M.

If you do not care about those rarely used high code points,
you may just ignore this feature and consider every character 16bit long.
That's called UCS-2.

> the downside of both UTF-8 and UTF-16 is that not all characters use the same 
> number of bytes, therefore search, replace, and other processing functions 
> don't work as easily as on limited 8-bit chars.
Many "implementations of UTF-16" actually implement UCS-2 and thus
can support such operations very efficiently.