|
On 5-Dec-06, at 10:10 AM, David Given wrote:
Glenn Maynard wrote: [...]Out of curiosity, what use is that? In particular, if a functionreturns a character offset, and you want to use it to address the string, you have to convert it to a byte offset--which is an expensive operation.I've used UTF-8 for years, and I can't remember the last time I wanted a character offset. (Even if you use wide strings, you still don't get those directly, due to combining characters.)I want to write a text editor, and so there'll be lots of nastyfetch-the-character-from-column-Z issues. Assuming each grapheme cluster renders into a single character cell --- which I know is not strictly valid, as some clusters will occupy multiple cells --- then dealing with characteroffsets instead of byte offsets will make life much easier.
If you ignore combining characters, your text editor will be quite problematic.
The way ncurses handles this is to define character cells (for the displayed area) which contain attribute information plus up to some number of unicode codes, a base character and some number of combining characters. (I can't remember off-hand if the default is five codes or five combinding characters, but it's settable with a compile-time flag in any event.) This makes grapheme cells quite large, of course: something like 24 bytes each. But it does allow rapid access to a character cell.
Another alternative would be to make a grapheme-index to byte-offset vector for each display line.
In the unix98 wide char model, a code-position can occupy any integer number of column positions, although it is typically 0, 1 or 2, the last being used for double-width east asian characters. There is a function wcswidth(3) which can be used to return the width in columns of a wide char string. (Also see wcwidth(3).) ncurses uses this interface to figure out grapheme boundaries, which is not 100% reliable but sometimes works. Unfortunately, there is no guarantee that the value returned by wcswidth() is the same as the number of columns used to render the grapheme, particularly if the display is on another machine.
This also ignores the issues of right-to-left and top-to-bottom renderings, and a few unicode corner cases.