lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Klaus Ripke <paul-lua@malete.org> writes:

> On Thu, Apr 26, 2007 at 01:47:02PM +0200, David Kastrup wrote:
>> The only documentation I have been able to find is in "unitest", and
>> it is very, very sketchy.
> or let's say terse ;)
> "UTF-8 operates on UTF-8 sequences as of RFC 3629".
> Even "format ... uses character counts for precision in %s".
> The grapheme module counts grapheme clusters.
>
>> --	NOTE: find positions are in bytes for all ctypes!
>> --	use ascii.sub to cut found ranges!
> right, utf8.find _returning_ byte positions has a special note,
> exactly because utf8.sub does NOT work with byte counts.

Does not sound really like a good idea.

>> It does not exactly sound like character-based indexing to me.
> sorry if this is confusing.

"confusing" is one thing, but it would appear that it does not make
much sense in this combination at all.

> Would be great if somebody would write some serious documentation.
> However, a quick look at the test cases reveals not only what the
> module is supposed to do, but what it actually does.

It may also be considered somewhat counterintuitive that the call

unicode.utf8.byte(unicode.utf8.char(5000))

returns 5000, something which naive people like myself would not
exactly choose to call a "byte".

There is also no way I see to treat illegal bytes/sequences any
different from legal characters having the respective code point.

One thing that would be worth noting in the documentation is that the
unicode library does not provide a string type of its own.  Instead it
provides functions that _interpret_ a standard luastring in a certain
manner.

-- 
David Kastrup