lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Fri, Jun 30, 2017 at 2:44 PM, Jay Carlson <nop@nop.com> wrote:
> u{"astral char", "\xEF\xBB\xBF\xF0\xA3\x8E\xB4",
>   expect={1}, rfc=true}
>
> MUST from RFC 3629:
> astral char     2
> expected        1

These tests are nice examples of where Lua's utf8.len() diverges from
the RFC, but the last one confuses me.

It looks like that byte sequence encodes two code points: U+FEFF and
U+233B4. Do you mean to say that utf8.len() should not count U+FEFF
because it appears at the start of the string (and so should be
considered a BOM)? That doesn't look like it's mandated by the RFC,
and I don't think would be a desired behavior for utf8.len().

- Ricardo