Re: UTF-8 testing

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: UTF-8 testing
From: Sean Conner <sean@...>
Date: Thu, 6 Jan 2011 16:59:45 -0500

It was thus said that the Great Henning Diedrich once stated:
> i am using this text [1] to test UTF-8 character counting.
> 
> Does somebody know how to get an authoritative count of how many that 
> should actually be? Mines possible invalid ones, should they be in that 
> text?
> 
> I am using this primitive counting mechanism. Inspired by [2]. Proposals 
> to improve are welcome.
> 
> Does size_t make sense?
> 
>       /* UTF-8 estimate */
>       unsigned char *p = (unsigned char *)getstr(rawtsvalue(rb));
>       unsigned char *q = p + tsvalue(rb)->len;
>       size_t count = 0;
>       while(p < q)
>           if(*p <= 127 || (*p >= 194 && *p <= 244)) /* this can be 
> reversed */
>              p++;
> 
> The above nails the sample text by 2 characters. I am looking for the 
> cause of the discrepancy.

  Here's a way of determining the length of a UTF-8 string; it assumes a
valid UTF-8 string to begin with:

static const char m_trailingbytes[256] =
{
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
  2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
};

size_t utflen(const char *st,size_t size)
{
  const unsigned char *s = (const unsigned char *)st;
  size_t               len;
  size_t               l;

  for (len = 0 ; size ; )
  {
    len++;
    l = m_trailingbytes[*s] + 1;   

    if (size < l) /* UTF-8 sequence cut off, makes it invalid */
      return 0;

    s    += l;
    size -= l;
  }
  return len;
}

  -spc (Been doing a lot of UTF-8 wrangling recently ... )

Follow-Ups:
- Re: UTF-8 testing, Henning Diedrich
- Re: UTF-8 testing, Miles Bader
- Re: UTF-8 testing, Henning Diedrich
- Re: UTF-8 testing, Tony Finch

References:
- UTF-8 testing, Henning Diedrich

Prev by Date: C API question
Next by Date: Re: Lua Cookbook
Previous by thread: Re: UTF-8 testing
Next by thread: Re: UTF-8 testing
Index(es):
- Date
- Thread