[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: newbie - Lua and unicode
- From: Javier Guerra <javier@...>
- Date: Thu, 14 Sep 2006 08:03:42 -0500
On Thursday 14 September 2006 7:50 am, Theodor-Iulian Ciobanu wrote:
> Thanks for all the info. As it was obvious, I don't have much knowledge
> about unicode, execpt I have some logs here that need parsing that I
> thought were UTF-16, 16 being the number of bits/character. Plus I was in a
> hury yesterday, so I admit not doing to much searching at first, before
> posting.
a brief simplification of unicode: (might be wrong, but i hope to get near..)
unicode defines a 32-bit space (for now); but encoding 32 bits for each
character is too wasteful and incompatible with ASCII text (7 bits)
UTF-8 means that unicodes characters 0-127 are written in a single byte,
exactly matching ASCII. higher codes start with 128-191 (6 bits) and a
second byte. even higher codes are done using higher values for the first
byte and more bytes after that.
UTF-16 uses two-bytes as the minimum character with. it uses the same
strategy of setting the first few bits of the first character to add more
bytes to a character.
the advantage of UTF-8 is that is can be simply read on any ASCII device, and
it's (mostly) readable, with some two-character garbage on 'special' chars.
the downside of both UTF-8 and UTF-16 is that not all characters use the same
number of bytes, therefore search, replace, and other processing functions
don't work as easily as on limited 8-bit chars.
Lua doesn't care what do you put in a string, so you can simply use any
character encoding you like, as long as the input and output is correct. the
only thing you have to have in mind is not to use standard the string
processing library. but i think there are replacements for them.
--
Javier
Attachment:
pgp1fpZCgADZH.pgp
Description: PGP signature