|
i am using this text [1] to test UTF-8
character counting. Does somebody know how to get an authoritative count of how many that should actually be? Mines possible invalid ones, should they be in that text? I am using this primitive counting mechanism. Inspired by [2]. Proposals to improve are welcome. Does size_t make sense? /* UTF-8 estimate */ unsigned char *p = (unsigned char *)getstr(rawtsvalue(rb)); unsigned char *q = p + tsvalue(rb)->len; size_t count = 0; while(p < q) if(*p <= 127 || (*p >= 194 && *p <= 244)) /* this can be reversed */ p++; The above nails the sample text by 2 characters. I am looking for the cause of the discrepancy. Thanks, Henning [1] https://gist.github.com/768309 [2] http://lua-users.org/wiki/LuaUnicode |