[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Lost in Unicode
- From: Roberto Ierusalimschy <roberto@...>
- Date: Mon, 20 Oct 2003 10:20:59 -0200
> I'd like to write an application that operates on text, including string
> containing accented letters such as "è" (I hope it shows correctly, it's an
> accented "e").
If the system uses ISO-8859-1 there is no problem at all. If it uses
utf-8 and the program source is also written in utf-8, the comparison
still works correctly. (Both `s' and "caffè" will have the same
internal representation, with 6 bytes.)
If the system may use two different representations, the simplest
solution is to translate to a fixed representation as soon as you read
something. If you can assume that all relevant utf-8 text can be mapped
to ISO-8859-1, it is better to use ISO-8859-1 internally. It is easy
to write a function to translate utf-8 to ISO-8859-1:
function toISO (s)
if string.find(s, "[\224-\255]") then error("non-ISO char") end
s = string.gsub(s, "([\192-\223])(.)", function (c1, c2)
c1 = string.byte(c1) - 192
c2 = string.byte(c2) - 128
return string.char(c1 * 64 + c2)
end)
return s
end
-- Roberto