i am using this text  to test UTF-8
Does somebody know how to get an authoritative count of how many that should actually be? Mines possible invalid ones, should they be in that text?
I am using this primitive counting mechanism. Inspired by . Proposals to improve are welcome.
Does size_t make sense?
/* UTF-8 estimate */
unsigned char *p = (unsigned char *)getstr(rawtsvalue(rb));
unsigned char *q = p + tsvalue(rb)->len;
size_t count = 0;
while(p < q)
if(*p <= 127 || (*p >= 194 && *p <= 244)) /* this can be reversed */
The above nails the sample text by 2 characters. I am looking for the cause of the discrepancy.