[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: utf8.codes ignores spurious continuation bytes
- From: Christian Ludwig <cl@...>
- Date: Sun, 18 Sep 2022 18:14:39 +0200
Hello Lua-Community,
I have the following question:
Lua 5.4.4 Copyright (C) 1994-2022 Lua.org, PUC-Rio
> for pos, cp in utf8.codes('in\xbfvalid') do print(pos, cp) end
1 105
2 110
4 118
5 97
6 108
7 105
8 100
Any spurious/fake conti-bytes are ignored in utf8.codes.
https://www.lua.org/manual/5.4/manual.html#pdf-utf8.codes
says: "It raises an error if it meets any invalid byte sequence."
But in the source
https://www.lua.org/source/5.4/lutf8lib.c.html
it seems to me this is done on purpose; in iter_aux
if (n < len) {
while (iscont(s + n)) n++; /* skip continuation bytes */
}
Is this done on prupose? Is it supposed to act like this?
If this is done on purpose, then I misread the manual. Sorry.
If it's not on purpose, then iter_aux has to be changed, e.g. the 3
lines above deleted and the "next" result of utf8_decode has to be used
to update "n" (instead of n+1) a few lines below.
Bye
C. Ludwig