[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: UTF-8 testing
- From: Sean Conner <sean@...>
- Date: Thu, 6 Jan 2011 16:59:45 -0500
It was thus said that the Great Henning Diedrich once stated:
> i am using this text [1] to test UTF-8 character counting.
>
> Does somebody know how to get an authoritative count of how many that
> should actually be? Mines possible invalid ones, should they be in that
> text?
>
> I am using this primitive counting mechanism. Inspired by [2]. Proposals
> to improve are welcome.
>
> Does size_t make sense?
>
> /* UTF-8 estimate */
> unsigned char *p = (unsigned char *)getstr(rawtsvalue(rb));
> unsigned char *q = p + tsvalue(rb)->len;
> size_t count = 0;
> while(p < q)
> if(*p <= 127 || (*p >= 194 && *p <= 244)) /* this can be
> reversed */
> p++;
>
> The above nails the sample text by 2 characters. I am looking for the
> cause of the discrepancy.
Here's a way of determining the length of a UTF-8 string; it assumes a
valid UTF-8 string to begin with:
static const char m_trailingbytes[256] =
{
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
};
size_t utflen(const char *st,size_t size)
{
const unsigned char *s = (const unsigned char *)st;
size_t len;
size_t l;
for (len = 0 ; size ; )
{
len++;
l = m_trailingbytes[*s] + 1;
if (size < l) /* UTF-8 sequence cut off, makes it invalid */
return 0;
s += l;
size -= l;
}
return len;
}
-spc (Been doing a lot of UTF-8 wrangling recently ... )