lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

This is shaping up to be a really exciting release! I've been playing
with the utf8 library:

$ lua-5.3.0-work2/src/lua
Lua 5.3.0 (work2)  Copyright (C) 1994-2014, PUC-Rio
> utf8.len('')
stdin:1: bad argument #1 to 'len' (initial position out of string)
stack traceback:
	[C]: in function 'len'
	stdin:1: in main chunk
	[C]: in ?

That's certainly a surprise. Is it intentional that this raises an
error, instead of returning 0 for the empty string like string.len?

> utf8.offset('a', 0, 3)
stdin:1: bad argument #3 to 'offset' (position out of range)
stack traceback:
	[C]: in function 'offset'
	stdin:1: in main chunk
	[C]: in ?
> utf8.offset('a', 0, 2)

I expect start offsets of 2 and 3 would both raise errors.

There is a typo in the manual for utf8.len: 'sufix' should be 'suffix'.

As an exercise, I tried to write a function using the new utf8 library
that would scan a string and replace any invalid UTF-8 sequences with a
substitution character.  I failed to come up with any solution that did
not either create many small strings (one for each codepoint in the
target string) or involve scanning the string byte-by-byte (which could
as easily be done with Lua 5.2). How much easier it would be if
utf8.len, upon encountering invalid UTF-8, returned nil plus the
byte offset of the offending sequence!