Re: utf8.codes ignores spurious continuation bytes

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: utf8.codes ignores spurious continuation bytes
From: bil til <biltil52@...>
Date: Mon, 19 Sep 2022 08:07:46 +0200

Thanks, this is interesting info.

I would have thought that single bytes in the range 0x80... 0xBF as in
your two sequences above would be allowed and then correspond to the
standard unicode value. (in the range 0xA0...0xBF this contains really
"nice chars" like µ ² ±.

In the new version for Wiki UTF8 (english) it is really stated
clearly, that ASCII bytes 0...0x7F must NOT be followed by a
continuation character 0x80... 0xBF... . (I think this is quite new
Wiki article, some months ago I did not recognize this at least when I
looked at this UTF8 description there in more detail).

I just would be a bit anxious that many UTF8 encodings "running around
in the web" would somehow ignore this rule, and just use such chars
0xA0...0xBF also as "single chars" for their Unicode equivalents (like
µ, ...).

Am So., 18. Sept. 2022 um 22:31 Uhr schrieb Christian Ludwig <cl@exomail.to>:
> Examples:
> s = '\x61\xbf\x62'
> s = '\x61\x80\x62'

Follow-Ups:
- Re: utf8.codes ignores spurious continuation bytes, Christian Ludwig

References:
- utf8.codes ignores spurious continuation bytes, Christian Ludwig
- Re: utf8.codes ignores spurious continuation bytes, bil til
- Re: utf8.codes ignores spurious continuation bytes, Christian Ludwig

Prev by Date: Re: Smallest Lua program that exercises the whole language
Next by Date: Re: utf8.codes ignores spurious continuation bytes
Previous by thread: Re: utf8.codes ignores spurious continuation bytes
Next by thread: Re: utf8.codes ignores spurious continuation bytes
Index(es):
- Date
- Thread