lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


As Petite mentioned, you can include any > 0x7F character in a single-byte Lua string unescaped (which includes the bytes of all extended characters in UTF-8).

If you want to escape Unicode characters, how depends on what byte encoding you want the character to be in. If you're using UTF-8 (generally the most sensible choice), you can include U+2500 as \226\148\128 (or \xe2\x94\x80 in 5.2).

If you're using UTF-16 (in which case you'll likely have to make other changes to the system, at which point you can just as well add a \u character encoding yourself), you can just split the character into two hex encodings (in 5.2, anyway): \x25\x00 (without hex encodings, it would be \37\0).

If you want this to be handled more easily, you can include something like HTML unicode character entities, and then write a function that will do the conversion for you into your encoding of choice:

  local function utf8(num)
    num = tonumber(num,16)
    local char = string.char
    local floor = math.floor
    local highbits = 7
    local sparebytes = 0
    while num >= 2^(highbits + sparebytes * 6) do
      highbits = highbits - 1
      if highbits < 1 then error "utf-8 sequence out of range" end
      sparebytes = sparebytes + 1
    end
    if sparebytes == 0 then
      return char(num)
    else
      local bytes = {}
      for i=1, sparebytes do
        local byte = floor((num / 2^((i-1)*6)) % 2^6)
        bytes[sparebytes+2-i] = char(byte + 2^7)
      end
      local byte = floor(num / 2^(sparebytes*6))
      bytes[1] = char(byte + 2^8 - 2^(highbits))
      return table.concat(bytes)
    end
  end

  return (string.gsub(input,"&u(%x%x%x%x%x?%x?);",utf8))

For more information on the complexities of multi-byte character encoding (which Lua chooses not to address), see http://lua-users.org/wiki/LuaUnicode.

On Fri, 02 Dec 2011 14:08:22 -0800, Bernd Eggink <monoped@sudrala.de> wrote:

Hi all,

it seems that Lua 5.2.0 (rc4) doesn't support unicode escape sequences, such as \u2500. Is there any chance that this could be implemented in the final version? It would make handling of exotic characters much easier.

Greetings,
Bernd