Re: utf8.codes ignores spurious continuation bytes

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: utf8.codes ignores spurious continuation bytes
From: Christian Ludwig <cl@...>
Date: Mon, 19 Sep 2022 13:43:08 +0200

Hello bil til,

> Thanks, this is interesting info.
> 
> I would have thought that single bytes in the range 0x80... 0xBF as in
> your two sequences above would be allowed and then correspond to the
> standard unicode value. (in the range 0xA0...0xBF this contains really
> "nice chars" like µ ² ±.

Since 2003, RFC 3629, Section 4 "An octet sequence is valid UTF-8 only
if it matches the following syntax ..." this is not allowed.

> 
> In the new version for Wiki UTF8 (english) it is really stated
> clearly, that ASCII bytes 0...0x7F must NOT be followed by a
> continuation character 0x80... 0xBF... . (I think this is quite new
> Wiki article, some months ago I did not recognize this at least when I
> looked at this UTF8 description there in more detail).

Even if the article you mentioned is new, the rule is old: see RFC 3629.
And if you need the rule from "The Unicode Consortium" then it's the
year 2004 where Unicode 4.0 came out. There: Chapter 3.9, rule D36.

> 
> I just would be a bit anxious that many UTF8 encodings "running around
> in the web" would somehow ignore this rule, and just use such chars
> 0xA0...0xBF also as "single chars" for their Unicode equivalents (like
> µ, ...).

Except Lua's utf8.codes I don't know any language ignoring this very old
rule from 2003. BTW Lua's utf8.len gets it right and correctly returns a
fail for the given examples.

> > Examples:
> > s = '\x61\xbf\x62'
> > s = '\x61\x80\x62'

Bye
C. Ludwig

References:
- utf8.codes ignores spurious continuation bytes, Christian Ludwig
- Re: utf8.codes ignores spurious continuation bytes, bil til
- Re: utf8.codes ignores spurious continuation bytes, Christian Ludwig
- Re: utf8.codes ignores spurious continuation bytes, bil til

Prev by Date: Re: utf8.codes ignores spurious continuation bytes
Next by Date: Re: utf8.codes ignores spurious continuation bytes
Previous by thread: Re: utf8.codes ignores spurious continuation bytes
Next by thread: Re: utf8.codes ignores spurious continuation bytes
Index(es):
- Date
- Thread