lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


i am using this text [1] to test UTF-8 character counting.

Does somebody know how to get an authoritative count of how many that should actually be? Mines possible invalid ones, should they be in that text?

I am using this primitive counting mechanism. Inspired by [2]. Proposals to improve are welcome.

Does size_t make sense?

      /* UTF-8 estimate */
      unsigned char *p = (unsigned char *)getstr(rawtsvalue(rb));
      unsigned char *q = p + tsvalue(rb)->len;
      size_t count = 0;
      while(p < q)
          if(*p <= 127 || (*p >= 194 && *p <= 244)) /* this can be reversed */
             p++;

The above nails the sample text by 2 characters. I am looking for the cause of the discrepancy.

Thanks,
Henning

[1] https://gist.github.com/768309
[2]
http://lua-users.org/wiki/LuaUnicode