Re: question about Unicode

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: question about Unicode
From: Rici Lake <lua@...>
Date: Tue, 5 Dec 2006 11:37:20 -0500


On 5-Dec-06, at 10:10 AM, David Given wrote:

Glenn Maynard wrote:
[...]
Out of curiosity, what use is that?  In particular, if a function
returns a character offset, and you want to use it to address thestring,you have to convert it to a byte offset--which is an expensiveoperation.
I've used UTF-8 for years, and I can't remember the last time I wanted
a character offset.  (Even if you use wide strings, you still don't
get those directly, due to combining characters.)
I want to write a text editor, and so there'll be lots of nasty
fetch-the-character-from-column-Z issues. Assuming each graphemeclusterrenders into a single character cell --- which I know is not strictlyvalid,as some clusters will occupy multiple cells --- then dealing withcharacter
offsets instead of byte offsets will make life much easier.

If you ignore combining characters, your text editor will be quiteproblematic.

The way ncurses handles this is to define character cells (for thedisplayed area) which contain attribute information plus up to somenumber of unicode codes, a base character and some number of combiningcharacters. (I can't remember off-hand if the default is five codes orfive combinding characters, but it's settable with a compile-time flagin any event.) This makes grapheme cells quite large, of course:something like 24 bytes each. But it does allow rapid access to acharacter cell.

Another alternative would be to make a grapheme-index to byte-offsetvector for each display line.

In the unix98 wide char model, a code-position can occupy any integernumber of column positions, although it is typically 0, 1 or 2, thelast being used for double-width east asian characters. There is afunction wcswidth(3) which can be used to return the width in columnsof a wide char string. (Also see wcwidth(3).) ncurses uses thisinterface to figure out grapheme boundaries, which is not 100% reliablebut sometimes works. Unfortunately, there is no guarantee that thevalue returned by wcswidth() is the same as the number of columns usedto render the grapheme, particularly if the display is on anothermachine.

This also ignores the issues of right-to-left and top-to-bottomrenderings, and a few unicode corner cases.

References:
- question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, Klaus Ripke
- Re: question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, David Given
- Re: question about Unicode, Glenn Maynard
- Re: question about Unicode, David Given

Prev by Date: Re: question about Unicode
Next by Date: Re: suggestion on error returns, especially in callbacks from C?
Previous by thread: Re: question about Unicode
Next by thread: Re: question about Unicode
Index(es):
- Date
- Thread