Re: Lost in Unicode

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Lost in Unicode
From: Roberto Ierusalimschy <roberto@...>
Date: Mon, 20 Oct 2003 10:20:59 -0200

> I'd like to write an application that operates on text, including string 
> containing accented letters such as "è" (I hope it shows correctly, it's an 
> accented "e").

If the system uses ISO-8859-1 there is no problem at all. If it uses
utf-8 and the program source is also written in utf-8, the comparison
still works correctly. (Both  `s' and "caffè" will have the same
internal representation, with 6 bytes.)

If the system may use two different representations, the simplest
solution is to translate to a fixed representation as soon as you read
something. If you can assume that all relevant utf-8 text can be mapped
to ISO-8859-1, it is better to use ISO-8859-1 internally. It is easy
to write a function to translate utf-8 to ISO-8859-1:

function toISO (s)
  if string.find(s, "[\224-\255]") then error("non-ISO char") end
  s = string.gsub(s, "([\192-\223])(.)", function (c1, c2)
        c1 = string.byte(c1) - 192
        c2 = string.byte(c2) - 128
        return string.char(c1 * 64 + c2)
      end)
  return s
end

-- Roberto

Follow-Ups:
- Re: Lost in Unicode, Enrico Colombini

References:
- Lost in Unicode, Enrico Colombini

Prev by Date: ExpLua (was Re: Profiler in Lua)
Next by Date: Re: shootout
Previous by thread: Lost in Unicode
Next by thread: Re: Lost in Unicode
Index(es):
- Date
- Thread