lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Hi Patrick!

I did quite some experimenting with tiny UTF-8 handling here:

Mainly concerning myself with getting the lenght.

For what it's worth, ended up with this to count string lenght:

         /* UTF-8 count */
         case LUA_TSTRING: {
           unsigned char *p = (unsigned char *)getstr(rawtsvalue(rb));
           unsigned char *q = p + tsvalue(rb)->len;
           size_t count = 0;
           while(p < q) if((*p++ & 0xC0) ^ 0x80) count++; /* count all lead bytes */
           setnvalue(ra, cast_num(count));            break;

The rational is spread out across this mailing list. Basically, corrupt UTF-8 should be allowed to have undefined results.


Patrick Rapin schrieb:
Essentially as an exercise, I tried to write the smaller possible
UTF-8 encoder in Lua [1].
Compared to a naive implementation like in [2], it is around 2.6 times shorter.
Still, I am wondering if the code could be further shorted (not
counting space removal).

[2]  (and that implementation doesn't
handle 4 bytes codes)