Re: Small UTF-8 encoder

Subject: Re: Small UTF-8 encoder
From: &quot;H. Diedrich&quot; &lt;hd2010@ ... &gt;
Date: Tue, 19 Jun 2012 18:46:39 +0200

Hi Patrick!

I did quite some experimenting with tiny UTF-8 handling here:

http://eonblast.com/trucount/
http://www.eonblast.com/trucount/lua-count-patch-0.1.tgz

Mainly concerning myself with getting the lenght.

For what it's worth, ended up with this to count string lenght:

         /* UTF-8 count */
         case LUA_TSTRING: {
           unsigned char *p = (unsigned char *)getstr(rawtsvalue(rb));
           unsigned char *q = p + tsvalue(rb)->len;
           size_t count = 0;
           while(p < q) if((*p++ & 0xC0) ^ 0x80) count++; /* count all lead bytes */
           setnvalue(ra, cast_num(count));            break;
         }

The rational is spread out across this mailing list. Basically, corrupt UTF-8 should be allowed to have undefined results.

Best,
Henning

Patrick Rapin schrieb:

Essentially as an exercise, I tried to write the smaller possible
UTF-8 encoder in Lua [1].
Compared to a naive implementation like in [2], it is around 2.6 times shorter.
Still, I am wondering if the code could be further shorted (not
counting space removal).

[1] https://gist.github.com/b0ae016da7b8f0b221ff
[2] http://lwn.net/Articles/493167/ (and that implementation doesn't
handle 4 bytes codes)