[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: utf8.codes ignores spurious continuation bytes
- From: Christian Ludwig <cl@...>
- Date: Mon, 19 Sep 2022 13:43:08 +0200
Hello bil til,
> Thanks, this is interesting info.
> I would have thought that single bytes in the range 0x80... 0xBF as in
> your two sequences above would be allowed and then correspond to the
> standard unicode value. (in the range 0xA0...0xBF this contains really
> "nice chars" like µ ² ±.
Since 2003, RFC 3629, Section 4 "An octet sequence is valid UTF-8 only
if it matches the following syntax ..." this is not allowed.
> In the new version for Wiki UTF8 (english) it is really stated
> clearly, that ASCII bytes 0...0x7F must NOT be followed by a
> continuation character 0x80... 0xBF... . (I think this is quite new
> Wiki article, some months ago I did not recognize this at least when I
> looked at this UTF8 description there in more detail).
Even if the article you mentioned is new, the rule is old: see RFC 3629.
And if you need the rule from "The Unicode Consortium" then it's the
year 2004 where Unicode 4.0 came out. There: Chapter 3.9, rule D36.
> I just would be a bit anxious that many UTF8 encodings "running around
> in the web" would somehow ignore this rule, and just use such chars
> 0xA0...0xBF also as "single chars" for their Unicode equivalents (like
> µ, ...).
Except Lua's utf8.codes I don't know any language ignoring this very old
rule from 2003. BTW Lua's utf8.len gets it right and correctly returns a
fail for the given examples.
> > Examples:
> > s = '\x61\xbf\x62'
> > s = '\x61\x80\x62'