lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


David Given <dg@cowlark.com> wrote

> It'll convert from any Unicode encoding to any other Unicode encoding.
> In your case, tell it to convert to UCS-32 and then you can read each
> Unicode code point as a number type.

You may also wrap the standard I/O functions into a Unicode compatibility
layer for a more natural usage. Something like this module (written
minutes ago and not extensively tested):

-------------------
module("uniopen", package.seeall)

require "iconv"

local mt = { __index = _M }

function open(fname, mode, fromcharset, tocharset)
  assert(mode == "r" or mode == "rb", "Only read modes are supported yet")
  tocharset = tocharset or "utf8"
  local cd = assert(iconv.new(fromcharset, tocharset), "Bad charset")
  local fp = io.open(fname, mode)
  if not fp then
    return nil
  end
  local o = { fp = fp, cd = cd }
  setmetatable(o, mt)
  return o;
end

function read(fp, mod)
  assert(fp and fp.fp and fp.cd, "Bad file descriptor")
  local ret = fp.fp:read(mod)
  if ret then
    return fp.cd:iconv(ret)  -- returns: string, error code
  else
    return nil
  end
end

function close(fp)
  assert(fp and fp.fp, "Bad file descriptor")
  fp.fp:close()
end
-------------------

As noted above, Unicode character splitting is a pretty complex subject.
Since you can not use fp:read(some_number_of_bytes) (it may get invalid
codepoints and yield iconv.ERROR_INCOMPLETE) there is no easy way to
limit your input to a secure length. I have not used slnunicode yet, but
I think it have some functions for these operations.


-- 
Alexandre Erwin Ittner - aittner@netuno.com.br
OpenPGP pubkey 0x0041A1FB @ http://pgp.mit.edu