[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Lua utf8.len violates RFC 3629? (was Re: [PATCH] Quoted String "%q" non-ascii escaping (w/ hex).)
- From: Ricardo Ramos Massaro <ricardo.massaro@...>
- Date: Fri, 30 Jun 2017 22:59:20 -0300
On Fri, Jun 30, 2017 at 2:44 PM, Jay Carlson <nop@nop.com> wrote:
> u{"astral char", "\xEF\xBB\xBF\xF0\xA3\x8E\xB4",
> expect={1}, rfc=true}
>
> MUST from RFC 3629:
> astral char 2
> expected 1
These tests are nice examples of where Lua's utf8.len() diverges from
the RFC, but the last one confuses me.
It looks like that byte sequence encodes two code points: U+FEFF and
U+233B4. Do you mean to say that utf8.len() should not count U+FEFF
because it appears at the start of the string (and so should be
considered a BOM)? That doesn't look like it's mandated by the RFC,
and I don't think would be a desired behavior for utf8.len().
- Ricardo