lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Thu, 2006-09-14 at 15:19 +0200, Klaus Ripke wrote:
> If you do not care about those rarely used high code points,
> you may just ignore this feature and consider every character 16bit long.
> That's called UCS-2.
> 

Here's where the big gotcha comes with Unicode. A code point does not
equal a "character". In unicode you can compose "characters" (aka
graphemes), using multiple codepoint entities. An a+umlaut, even though
it's a latin1 character in the older ISO standards, can be represented
by one or three 16-bit codepoint values.

"UCS-2" does not solve any of these issues, even if you constrain your
environment as you have done above (where you assume high codepoints
will never be encountered). You would need to implement a Unicode
normalization function, which puts you squarely back in the camp of
having to parse Unicode intelligently and with a specialized API.

There is no _easy_ way to do Unicode. You cannot simply say, "I don't
care about high codepoint 'characters' so I'll parse UTF8/UTF16/UTF32
naively." That just doesn't fly; there have been and there will be many,
many instances where this behavior causes compatibility problems and
security vulnerabilities (where a broken "streq()" function can be
subverted by attackers who have bothered to read the Unicode standard,
and not just specs on codepoint representation schemes).

-- 
William Ahern <wahern@barracudanetworks.com>


--------------------------------------------------
This message was scanned for Spam, Spyware and Viruses
For more information, please visit:
http://www.barracudanetworks.com