[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Lua utf8.len violates RFC 3629? (was Re: [PATCH] Quoted String "%q" non-ascii escaping (w/ hex).)
- From: Jay Carlson <nop@...>
- Date: Sat, 1 Jul 2017 00:18:05 -0400
> On Jun 30, 2017, at 9:59 PM, Ricardo Ramos Massaro <ricardo.massaro@gmail.com> wrote:
>
> On Fri, Jun 30, 2017 at 2:44 PM, Jay Carlson <nop@nop.com> wrote:
>> u{"astral char", "\xEF\xBB\xBF\xF0\xA3\x8E\xB4",
>> expect={1}, rfc=true}
>>
>> MUST from RFC 3629:
>> astral char 2
>> expected 1
>
> These tests are nice examples of where Lua's utf8.len() diverges from
> the RFC, but the last one confuses me.
>
> It looks like that byte sequence encodes two code points: U+FEFF and
> U+233B4.
Oops, that one confused me too! You are right, and I was not paying attention to what I was cutting&pasting. Thanks.
> Do you mean to say that utf8.len() should not count U+FEFF
> because it appears at the start of the string (and so should be
> considered a BOM)? That doesn't look like it's mandated by the RFC,
> and I don't think would be a desired behavior for utf8.len().
I agree. If there were to be any special behavior it would have been to return nil,1, since the BOM is a Unicode noncharacter.
I think knowledge of the BOM should stay outside the Lua core, on the basis that it is more a Unicode property than a UTF-8 property. Plus the BOM should die.
On the other hand, we could often use a little help with reading input anyway. A Lua library handling Unicode input processing could cover the BOM, plus deal with replacement characters in the face of invalid input. (See “Unicode Security Considerations”, UTR-36 http://www.unicode.org/reports/tr36/ for why doing this right can be important.)
--
Jay Carlson
nop@nop.com
Attachment:
signature.asc
Description: Message signed with OpenPGP using GPGMail