lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


> On Jun 30, 2017, at 9:59 PM, Ricardo Ramos Massaro <ricardo.massaro@gmail.com> wrote:
> 
> On Fri, Jun 30, 2017 at 2:44 PM, Jay Carlson <nop@nop.com> wrote:
>> u{"astral char", "\xEF\xBB\xBF\xF0\xA3\x8E\xB4",
>>  expect={1}, rfc=true}
>> 
>> MUST from RFC 3629:
>> astral char     2
>> expected        1
> 
> These tests are nice examples of where Lua's utf8.len() diverges from
> the RFC, but the last one confuses me.
> 
> It looks like that byte sequence encodes two code points: U+FEFF and
> U+233B4.

Oops, that one confused me too! You are right, and I was not paying attention to what I was cutting&pasting. Thanks.

> Do you mean to say that utf8.len() should not count U+FEFF
> because it appears at the start of the string (and so should be
> considered a BOM)? That doesn't look like it's mandated by the RFC,
> and I don't think would be a desired behavior for utf8.len().

I agree. If there were to be any special behavior it would have been to return nil,1, since the BOM is a Unicode noncharacter.

I think knowledge of the BOM should stay outside the Lua core, on the basis that it is more a Unicode property than a UTF-8 property. Plus the BOM should die.

On the other hand, we could often use a little help with reading input anyway. A Lua library handling Unicode input processing could cover the BOM, plus deal with replacement characters in the face of invalid input. (See “Unicode Security Considerations”, UTR-36 http://www.unicode.org/reports/tr36/ for why doing this right can be important.)

--
Jay Carlson
nop@nop.com

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail