[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: question about Unicode
- From: Klaus Ripke <paul-lua@...>
- Date: Tue, 5 Dec 2006 16:41:40 +0100
On Tue, Dec 05, 2006 at 10:23:24AM -0500, Jerome Vuarand wrote:
> David Given wrote:
> > I want to write a text editor, and so there'll be lots of
> > nasty fetch-the-character-from-column-Z issues. Assuming each
> > grapheme cluster renders into a single character cell ---
> > which I know is not strictly valid, as some clusters will
> > occupy multiple cells --- then dealing with character offsets
> > instead of byte offsets will make life much easier.
> Also keep in mind that many Unicode characters are meant to be combined with others (`+E gives È for example), and as such you will have multiple unicode codepoints for a single grapheme (and a single character cell). Character offset in unicode strings don't reflect grapheme offset in the string graphical representation, even with fixed width fonts.
in slnunicode, it's the pattern matching functions that return byte offsets,
because they are faster to compute, faster when used as index for sub
(must use provided ascii.sub to cut based on bytes)
and more reliable, especially than grapheme based counting.
You can use utf8.len or grapheme.len to count code points or graphemes, resp.
That's what the library would have to do anyway.
Hmm, I see it might be handy to provide byte based offset and length
as optional parameters for the multibyte based versions of len().