lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Thu, 2006-09-14 at 13:30 -0700, William Ahern wrote:
> Here's where the big gotcha comes with Unicode. A code point does not
> equal a "character". In unicode you can compose "characters" (aka
> graphemes), using multiple codepoint entities. An a+umlaut, even though
> it's a latin1 character in the older ISO standards, can be represented
> by one or three 16-bit codepoint values.
> 

Actually, there are three ways to represent this on screen, and they're
equivalency is dependent on the application and usage. If I was scanning
logs visually and grepping for a+umlaut, I'd probably want my search key
to match all of these:

1) U+00E4
2) U+0061 U+0308
3) U+0061 U+034F U+0308

These examples are valid in both UCS-2 and UTF-16.

-- 
William Ahern <wahern@barracudanetworks.com>


--------------------------------------------------
This message was scanned for Spam, Spyware and Viruses
For more information, please visit:
http://www.barracudanetworks.com