Re: UTF-8 testing

What an incredibly clean inheritance the English-Latin alphabet is. 

And in that spirit, here is my current proposal. Thanks to all who contributed. So for the 'count' operator, I am reverting back to 


      unsigned char *p = (unsigned char *)getstr(rawtsvalue(rb));
      unsigned char *q = p + tsvalue(rb)->len;
      size_t count = 0;
      while(p < q) if(*p++ & 0xC0 ^ 0x80) count++; /* count all lead bytes */


It turned out to be beyond reasonable source line count to try to identify the /number of broken codes/. And what for. That one-liner algorhithm above sure looks like Lua to me.

The above works fabulously on legit utf-8. And is undefined on illegal codes. That /is/ thanks to the grace of UTF-8, which /does/ in fact exist.


Digging into UTF-8 specs was quite interesting but ultimately as sobering as was to be expected. Who cares to whitness: Unicode and UTF-8 -  I finally understand, after all this years, there /is no/ such thing as an unambiguous string length any more. :-D I had already suspected that. Well. One can still try. You got to love this normalization stuff but --- it's hard to count. Characters collapsing into one. Multiple binary ways to express the same code. Space discrimination against non-Western codes (for Cyrillic the byte count simply doubles as compared to the codes in use before. With no gain in entropy of course).

I had at one point realized that the higher areas for the proposals of Eero and Sean don't match. Eero throws away everything with the first byte > 0xF4. Sean has even 5 and 6 byte long sequences.

So I dug up something closer to a spec [1] and consulted C, PHP and Python.

The bit logic of UTF-8 is quite amazing, especially the implications [2], I loath it much less now. But when I tried to verify the length measurement results with a third party C library function -- wow, what a struggle. Not only could my word editors not agree, neither could the cited program languages.

So all the while, for the code below, I reverted to bit operators but basically returned to the structure initially proposed.

The main improvement of the code below is the handling of the highest values. The hard limit for UTF-8 is 0x10FFFF, as we've found. That's taken very seriously as it seems, to avert exploits and the code below now accounts for that. And also, as before, for the doubly presentable range 0x40 to 0x4f, for which the two byte presentation is a separate, small illegal range. The code below should be complete in that respect. 


	  unsigned char *c = (unsigned char *)getstr(rawtsvalue(rb));
	  unsigned char *q = c + tsvalue(rb)->len;
	  size_t count = 0;

	  while(c < q) {
		if (c[0] <= 127 ) c++;
		else if ((c[0] >= 0xC2) && (c[0] <= 0xDF) && ((c[1] & 0xC0) == 0x80)) c+=2;
		else if (((c[0] & 0xF0) == 0xE0) && ((c[1] & 0xC0) == 0x80) && ((c[2] & 0xC0) == 0x80)) c+=3;
		else if (((c[0] & 0xF8) == 0xF0) && ((c[1] & 0xC0) == 0x80) && ((c[2] & 0xC0) == 0x80) && ((c[3] & 0xC0) == 0x80)) c+=4;
		else if ((c[0] == 0xF4) && ((c[1] & 0xF0) == 0x80) && ((c[2] & 0xC0) == 0x80) && ((c[3] & 0xC0) == 0x80)) c+=4;
		else { count--; c++; } /* invalid sequence, don't count */
		count++;
	  }


The naive hope was, that |((c[1] & 0xC0) == 0x80)| could be faster than |((c[1] >= 0x80 && c[1] <= 0xBF)| on some platforms.

In any case, written this way, it's easier to verify against the UTF-8 spec bit patterns [6].

But as it turned out, there are more possibilities to create invalid character codes. For example, plenty of ways to encode a 0. And generally, any code that can be shown with 1 byte can be expressed with 2, any with 1 or 2 with 3, any with 1 or 2 or 3 with 4. That would leave but the option to actually assemble the code and try out if it was assembled right ... for what, for finding the exact count of illegal utf-8 codes. I gave that up. Which means that any attempt to identify invalid codes becomes useless and therefore, all that's needed is

while(p < q) if(*p++ & 0xC0 ^ 0x80) count++; /* count all lead bytes */

which is functionally the same as what [4] proposed.

And coming back to my request for verifications of the utf-8 counts - the solution now seems to be 37,074. PHP says so, and the code above and at the top, too, for the sample at [5].

I'd still be interested in verification by your text editor or scripting on a different platform.

In case, you get them with
wget https://gist.github.com/raw/768309/54a8c4f8169948852f83dcc491aa7a43dc54ec54/utf8-test.txt
wget http://www.columbia.edu/kermit/utf8.html
wget http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

Here are some samples (only utf8-test.txt is 100% legal -- and maybe not even that one for 1 code.)

PHP (Linux)
utf-8 length of utf8-test.txt: 37074
utf-8 length of utf8.html: 50966
utf-8 length of UTF-8-test.txt: 20073

Python (Linux)
File: utf8-test.txt Len: 37074
File: utf8.html Len: 50966
File: UTF-8-test.txt Len: error while decoding.

Python (Mac OS X)
File: utf8-test.txt Len: 37077
File: utf8.html Len: **50998**
File: UTF-8-test.txt Len: error while decoding.

Thanks again for the input,
Henning

[1] Wikipedia on UTF-8: http://en.wikipedia.org/wiki/Utf-8
[2] Implications: http://en.wikipedia.org/wiki/UTF-8#Advantages_and_disadvantages
[3] Python and UTF-8: http://evanjones.ca/python-utf8.html
[4] http://lua-users.org/wiki/LuaUnicode
[5] https://gist.github.com/raw/768309/54a8c4f8169948852f83dcc491aa7a43dc54ec54/utf8-test.txt
[6] http://en.wikipedia.org/wiki/UTF-8#Description