on invalid UTF-8 byte sequences

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: on invalid UTF-8 byte sequences
From: Stephan Hennig <sh-list@...>
Date: Wed, 31 Jan 2018 19:51:30 +0100

Hi,

the manual has this on utf8.len()

    [...] If it finds any invalid byte sequence, returns a false
    value plus the position of the first invalid byte.
Which (to me) seems not to specify what the first invalid byte in an
invalid byte sequence is.  Is it the first byte that invalidates a byte
sequence or the first byte of the whole invalid byte sequence?

Can an interested Lua user who has not carefully studied the Unicode
specs, which is an external resource, safely infer the output of

  $ lua -e "print(utf8.len('\xc3\xc4'))"

from the Lua manual alone?

The given UTF-8 string refers to German umlaut character Ä, 0xc384,
except that the second byte's second-most significant bit is flipped.
The first byte now introduces a multi-byte sequence, but is not followed
by a continuation byte.  As given, both bytes are invalid.  But which is
the first invalid one?

Without giving a spoiler (hopefully), in this case[1], Lua seems to be
in line with official Unicode/UTF-8 specs that are more clear on the
handling of invalid UTF-8 material.  But a slight change in the Lua
manual could make it more self-contained.

Best regards,
Stephan Hennig

[1] In <http://lua-users.org/lists/lua-l/2015-12/msg00063.html>, it has
been reported that utf8.len() does not fully comply to Unicode specs
with regards to flagging invalid UTF-8 input.  (Reproduced with Lua
5.3.4 here.)

Prev by Date: Re: Numeric for loop with rationals
Next by Date: Re: Numeric for loop with rationals
Previous by thread: Re: lpeg re.lua bug and provided fix
Next by thread: Debug Lua embeded without access to c++ source
Index(es):
- Date
- Thread