[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: utf8.codes ignores spurious continuation bytes
- From: Christian Ludwig <cl@...>
- Date: Sun, 18 Sep 2022 22:30:58 +0200
> Can you give an example of an UTF8 byte sequence, where this is
> critical / happens / creates possibly misunderstandings?
> (but the UTF bygtes please also in Hex code).
There is no *valid* UTF-8 byte sequence where this happens. It happens
for invalid UTF-8 byte sequences that have bytes of the form 0x10xxxxxx
in there which are not used as UTF-8 continuity bytes.
s = '\x61\xbf\x62'
s = '\x61\x80\x62'
The manual says
"It raises an error if it meets any invalid byte sequence."
It does not raise an error for such bytes (yes, there are other invalid
byte sequences where you see an error message, e.g. s = '\x61\xff\x62').
Is this done on purpose (for conti-bytes) for some reason and the manual
has to be clarified or is it a bug in the code not doing the thing as
mentioned in the manual?