lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

On Thursday 14 September 2006 7:50 am, Theodor-Iulian Ciobanu wrote:
> Thanks for all the info. As it was obvious, I don't have much knowledge
> about unicode, execpt I have some logs here that need parsing that I
> thought were UTF-16, 16 being the number of bits/character. Plus I was in a
> hury yesterday, so I admit not doing to much searching at first, before
> posting.

a brief simplification of unicode: (might be wrong, but i hope to get near..)

unicode defines a 32-bit space (for now); but encoding 32 bits for each 
character is too wasteful and incompatible with ASCII text (7 bits)

UTF-8 means that unicodes characters 0-127 are written in a single byte, 
exactly matching ASCII.  higher codes start with 128-191 (6 bits) and a 
second byte. even higher codes are done using higher values for the first 
byte and more bytes after that.

UTF-16 uses two-bytes as the minimum character with.  it uses the same 
strategy of setting the first few bits of the first character to add more 
bytes to a character.

the advantage of UTF-8 is that is can be simply read on any ASCII device, and 
it's (mostly) readable, with some two-character garbage on 'special' chars. 
the downside of both UTF-8 and UTF-16 is that not all characters use the same 
number of bytes, therefore search, replace, and other processing functions 
don't work as easily as on limited 8-bit chars.

Lua doesn't care what do you put in a string, so you can simply use any 
character encoding you like, as long as the input and output is correct.  the 
only thing you have to have in mind is not to use standard the string 
processing library.  but i think there are replacements for them.


Attachment: pgpWMPyyvXZGY.pgp
Description: PGP signature