lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

On 28-Oct-05, at 6:56 PM, David Given wrote:

On Friday 28 October 2005 22:13, Rici Lake wrote:
The full pattern: [^\128-\191][\128-\191]

"Not a continuation byte" followed by 0 or more "continuation bytes"

Should there be a * on the end of that pattern? Because what you wrote matches
'not a continuation byte' followed by 'exactly one continuation byte'.

Quite right, a cut and paste error. The one in the original message was correct.

Here's some sample code, which simply turns every character in all of the command line arguments into a U+hex code:

-- Non validating (and potentially faster) implementation
function string.eachutf8(str)
  return str:gfind("[^\128-\191][\128-\191]*")

local prefix = {}
for i = 0, 127 do prefix[i] = i end
for i = 194, 223 do prefix[i] = i - 192 end
for i = 224, 239 do prefix[i] = i - 224 end
for i = 240, 244 do prefix[i] = i - 240 end

function string.toucode(seq)
  local accum = prefix[seq:byte(1)]
  for i = 2, #seq do
    accum = accum * 64 + seq:byte(i) % 64
  return accum

for utf8seq in table.concat(arg, " "):eachutf8() do
  io.write(("U+%X "):format(utf8seq:toucode()))

Note that only three lines of this code are the actually library function :) The following test has the peculiar usage of `cat` in order to let me type the test line without ncurses, which has no unicode support on the OS I use (although the terminal does):

rlake@freeb:~/xml/lualib$ lua51 quickutf8.lua `cat`
Mañana.  ЖЄЫЩ  ฌญมかぢに∀x.x∌ℜ
U+4D U+61 U+F1 U+61 U+6E U+61 U+2E U+20 U+416 U+404 U+42B U+429 U+20 U+E0C U+E0D U+E21 U+304B U+3062 U+306B U+2200 U+78 U+2E U+78 U+220C U+211C

I don't speak any of the languages except the first one, so I hope I haven't committed any major faux pas by twiddling at the keyboard.