[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: question about Unicode
- From: David Given <dg@...>
- Date: Tue, 05 Dec 2006 16:33:48 +0000
Jerome Vuarand wrote:
> Also keep in mind that many Unicode characters are meant to be combined
> with others (`+E gives È for example), and as such you will have multiple
> unicode codepoints for a single grapheme (and a single character cell).
> Character offset in unicode strings don't reflect grapheme offset in the
> string graphical representation, even with fixed width fonts.
That's why I said 'grapheme clusters'...
In fact, when dealing with UTF-8 strings, all text should be normalised so you
*don't* get the issue you mention above. Multiple-character graphemes should
be collapsed down into a single character whereever possible (I believe that
it is possible for all romance languages, but I could be wrong).
However, I'm slowly coming to the conclusion that I'm going to have to write
some custom code for dealing with all this simply due to that fact that what
I'm really interested in is physical character width, which means I'm going to
have to call wcwidth() a lot. Sigh.
<musing type="out loud">
So, I need:
- a function to wrap a paragraph of text.
- a function to draw a line of text, positioning the cursor in the right place.
- a function to step forwards or backwards through a string a certain number
of grapheme clusters.
I think that's all I need. I should be able to do the rest with just those
three, and conventional string munging tools. Hmm...
╭─┈David Given┈──McQ─╮ "There are two major products that come out of
│┈┈email@example.com┈┈┈┈│ Berkeley: LSD and Unix. We don't believe this to be
│┈(firstname.lastname@example.org)┈│ a coincidence." --- Jeremy S. Anderson
Description: OpenPGP digital signature