[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Unicode escape sequences?
- From: "Stuart P. Bentley" <stuart@...>
- Date: Fri, 02 Dec 2011 21:09:33 -0800
As Petite mentioned, you can include any > 0x7F character in a single-byte
Lua string unescaped (which includes the bytes of all extended characters
in UTF-8).
If you want to escape Unicode characters, how depends on what byte
encoding you want the character to be in. If you're using UTF-8 (generally
the most sensible choice), you can include U+2500 as \226\148\128 (or
\xe2\x94\x80 in 5.2).
If you're using UTF-16 (in which case you'll likely have to make other
changes to the system, at which point you can just as well add a \u
character encoding yourself), you can just split the character into two
hex encodings (in 5.2, anyway): \x25\x00 (without hex encodings, it would
be \37\0).
If you want this to be handled more easily, you can include something like
HTML unicode character entities, and then write a function that will do
the conversion for you into your encoding of choice:
local function utf8(num)
num = tonumber(num,16)
local char = string.char
local floor = math.floor
local highbits = 7
local sparebytes = 0
while num >= 2^(highbits + sparebytes * 6) do
highbits = highbits - 1
if highbits < 1 then error "utf-8 sequence out of range" end
sparebytes = sparebytes + 1
end
if sparebytes == 0 then
return char(num)
else
local bytes = {}
for i=1, sparebytes do
local byte = floor((num / 2^((i-1)*6)) % 2^6)
bytes[sparebytes+2-i] = char(byte + 2^7)
end
local byte = floor(num / 2^(sparebytes*6))
bytes[1] = char(byte + 2^8 - 2^(highbits))
return table.concat(bytes)
end
end
return (string.gsub(input,"&u(%x%x%x%x%x?%x?);",utf8))
For more information on the complexities of multi-byte character encoding
(which Lua chooses not to address), see
http://lua-users.org/wiki/LuaUnicode.
On Fri, 02 Dec 2011 14:08:22 -0800, Bernd Eggink <monoped@sudrala.de>
wrote:
Hi all,
it seems that Lua 5.2.0 (rc4) doesn't support unicode escape sequences,
such as \u2500. Is there any chance that this could be implemented in
the final version? It would make handling of exotic characters much
easier.
Greetings,
Bernd