lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

On Fri, Mar 21, 2014 at 11:35 PM, Coroutines <> wrote:
> On Fri, Mar 21, 2014 at 1:44 PM, Luiz Henrique de Figueiredo
> <> wrote:
>> Lua 5.3.0 (work2) is now available for testing at
>> MD5     52bd13d0b40f637bc388a133b9bb8771  -
>> SHA1    e52ea0acf4b2d7bf042f48bd01dddc149d517184  -
>> This is a work version. An updated reference manual is included but
>> all details may change in the final version. See
>> The main change in Lua 5.3.0 is the introduction of integers.
>> For other changes, see
>> The complete diffs from work1 are at
>> Enjoy. All feedback welcome. Thanks.
>> --lhf
> After having a look at the utf8 library in 5.3, I like what I see -- I
> only wish for the addition of a utf8.sub() and a utf8.strip()
> 1) utf8.sub() would be a utf8-aware version of string.sub; 'nuff said
> 2) utf8.strip() would remove the greater-than-single-byte codepoints
> from a string -- it is possible to do this with the
> generator you guys provided, but in the interest of speed I think this
> should be in C
> I really like utf8.offset(), that's something I should have added to
> my own project:
> Please do add utf8.sub() to be identical to string.sub()'s behavior
> (apart from being utf8-aware).  I imagine a lot of people would wind
> up rolling their own with varying subtle differences (like some
> accepting negative indices and some not).  If you *really* want to
> impress, make generator/iterator that can cycle through a utf8 string
> backward -- I would love that!  ((I also want a string.find() that
> iterates backward))
> Anyway, hope I was helpful :-)

I think I'll take the utf8.strip() proposal back -- it looks like this
utf8 library doesn't check that strings are valid utf8.  It makes sure
the byte sequence fits the form of a utf8 codepoint, but it doesn't
take into account things like unused codepoints or overlong encodings.
 utf8.strip() might lead people to believe it will remove all utf8,
even invalid but valid to the byte format codepoints.  Bad idea.. so
I'll forget that.

I still want a utf8.sub().  Also, I think it should be noted in the
manual that utf8.charpatt will find valid utf8 byte sequences, but
will not take into account what I mentioned above (unused codepoints,
overlong, unexpected continuation bytes... etc).  It is a practical
utf8 library but not a foolproof solution (no disrespect meant).  I am
very happy to see this added :-)

PS: I think charpatt could be improved:
"[\0-\x7F\xC2-\xF4][\x80-\xBF]*" ->