Re: Unicode escape sequences?

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Unicode escape sequences?
From: "Stuart P. Bentley" <stuart@...>
Date: Fri, 02 Dec 2011 21:09:33 -0800

As Petite mentioned, you can include any > 0x7F character in a single-byteLua string unescaped (which includes the bytes of all extended charactersin UTF-8).

If you want to escape Unicode characters, how depends on what byteencoding you want the character to be in. If you're using UTF-8 (generallythe most sensible choice), you can include U+2500 as \226\148\128 (or\xe2\x94\x80 in 5.2).

If you're using UTF-16 (in which case you'll likely have to make otherchanges to the system, at which point you can just as well add a \ucharacter encoding yourself), you can just split the character into twohex encodings (in 5.2, anyway): \x25\x00 (without hex encodings, it wouldbe \37\0).

If you want this to be handled more easily, you can include something likeHTML unicode character entities, and then write a function that will dothe conversion for you into your encoding of choice:


  local function utf8(num)
    num = tonumber(num,16)
    local char = string.char
    local floor = math.floor
    local highbits = 7
    local sparebytes = 0
    while num >= 2^(highbits + sparebytes * 6) do
      highbits = highbits - 1
      if highbits < 1 then error "utf-8 sequence out of range" end
      sparebytes = sparebytes + 1
    end
    if sparebytes == 0 then
      return char(num)
    else
      local bytes = {}
      for i=1, sparebytes do
        local byte = floor((num / 2^((i-1)*6)) % 2^6)
        bytes[sparebytes+2-i] = char(byte + 2^7)
      end
      local byte = floor(num / 2^(sparebytes*6))
      bytes[1] = char(byte + 2^8 - 2^(highbits))
      return table.concat(bytes)
    end
  end

  return (string.gsub(input,"&u(%x%x%x%x%x?%x?);",utf8))

For more information on the complexities of multi-byte character encoding(which Lua chooses not to address), seehttp://lua-users.org/wiki/LuaUnicode.

On Fri, 02 Dec 2011 14:08:22 -0800, Bernd Eggink <monoped@sudrala.de>wrote:

Hi all,
it seems that Lua 5.2.0 (rc4) doesn't support unicode escape sequences,such as \u2500. Is there any chance that this could be implemented inthe final version? It would make handling of exotic characters mucheasier.
Greetings,
Bernd

Follow-Ups:
- Re: Unicode escape sequences?, Bernd Eggink

References:
- Unicode escape sequences?, Bernd Eggink

Prev by Date: luajit cross-compilation problem - missing stdint.h
Next by Date: Re: Garbage Collector collecting the code as it executes?
Previous by thread: Re: Unicode escape sequences?
Next by thread: Re: Unicode escape sequences?
Index(es):
- Date
- Thread