lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On 2015-12-09, at 9:32 PM, Jonathan Goble <jcgoble3@gmail.com> wrote:
> 
> On Wed, Dec 9, 2015 at 9:29 PM, Jay Carlson <nop@nop.com> wrote:
>> Given a string where is_utf8(s) is false, it might be nice to be able to find the byte offset of the first non-UTF-8 sequence.
> 
> utf8.len() already does this.

utf8.len doesn't match the standard definition of UTF-8. Consider this example of an invalid sequence, taken from https://tools.ietf.org/html/rfc3629#section-4 :

Lua 5.3.1  Copyright (C) 1994-2015 Lua.org, PUC-Rio
> utf8.len("\xED\xA1\x8C")
1

The production it's supposed to match is:

>    UTF8-octets = *( UTF8-char )
>    UTF8-char   = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4
>    UTF8-1      = %x00-7F
>    UTF8-2      = %xC2-DF UTF8-tail
>    UTF8-3      = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
>                  %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
>    UTF8-4      = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
>                  %xF4 %x80-8F 2( UTF8-tail )
>    UTF8-tail   = %x80-BF


Jay