lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On 06/09/2021 15:50, Roberto Ierusalimschy wrote:
The whole mess of UB is just that: people thinks "most implementation won't
do something silly in this case", then you find the "right" compiler
switch, the "right" compiler version, the "right" DLL linked-in, the
"right" C-lib version and some years down the road something goes horribly
wrong.

If you follow this line up to its logical end, it becomes impossible
to program in C.

A small illustration: as far as I can find, the standard says nothing
about the possibility of a stack overflow due to too many pending
calls. There is no way to check it, there is no ensured minimum,
it is not defined as undefined, nothing. I see two ways to interpret
this.


Well, I agree that a standard cannot cover absolutely everything, and
C doesn't even try. In fact everything the standard says is about the abstract machine, not a real, physical machine, as stated in "5.1.2.3 Program execution".

However they went a long way to define what is "UB(TM)" versus what "common people" call "undefined behavior" (in the sense that no-one has defined it). Can we call it "plain UB"?

So I guess the committee esplicitly marked as "UB(TM)" those areas where they deemed that giving an implementation absolute freedom will allow room for foreseeable optimizations or areas where defining a behavior would have been too burdensome for compiler makers.

So, AFAIU, the committee defined as "UB(TM)" only those things that actually can be avoided by a programmer (although sometimes with extreme care). In fact they stated that if a program contains even an instance of "UB(TM)" the program is considered erroneous, unless that "UB(TM)" has been defined by the implementation as an extension, in that case the program is declared "non-portable".

In the case of "plain UB", i.e. those cases which could wreak havoc but about which the standard is silent, then I assume they all fall under the "implementation detail" hat.

So your counterargument is right if we consider "plain UB". However, if we stick to just avoiding "UB(TM)", then it must be possible (by definition), otherwise the committee would be implicitly declaring every C program as erroneous because of this purported impossibility.

As I said, the standard terminology choice is unfortunate in that it gives an extremely precise meaning (UB(TM)) to a general term used in programming (plain UB). They could have chosen other terms, but alas we are stuck with that.

In particular, see the definition in C99 (N1256 draft):

-----------------------------------------------------
3.4.3

1 undefined behavior

behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements

2 NOTE Possible undefined behavior ranges from ignoring the situation
completely with unpredictable
results, to behaving during translation or program execution in a
documented manner characteristic of the
environment (with or without the issuance of a diagnostic message), to
terminating a translation or
execution (with the issuance of a diagnostic message).

-----------------------------------------------------

So a case of UB(TM) is NOT necessarily plain UB (ugh!), but is a term used to flag erroneous or nonportable constructs or data.


Option 1: As the standard never mentions that a function call can
go wrong due to "stack overflow" (too many pending calls), then all
function calls should work as described, no matter how many pending
calls there are in the execution. As they don't, it follows that
all compilers we know about are badly buggy.

Option 2: We accept that the number of pending function calls have some
implicit limit, and once a program crosses that limit we have some
undefined behavior. As the standard does not set a minimum for this
limit (which does not even exist, according to the standard), it can be
be any value. A single call to 'printf' in helloword.c can legitimately
cause a stack overflow and therefore undefined behavior. (The standard
also offers no way to check this limit.) If we cannot accept UB, no
matter what, then we should never call any functions in our programs.
It doesn't matter that such calls always worked in all compilers
we ever used; some years down the road something can go horribly
wrong, and we have only ourselves to blame.

-- Roberto



Your example about the stack depth limit is not covered by the standard because the abstract machine doesn't even have a stack concept.

FWIW, the abstract machine doesn't even have the concept of different address spaces, so accessing data in different address spaces, e.g. in the flash memory of a MCU instead of its RAM, usually uses non-portable syntax that is compiler-specific.

So programming in C on a real machine requires the knowledge of BOTH the abstract machine AND the real machine. The standard only requires so much from an implementation and hopefully defines every relevant aspects of the abstract machine that allows avoiding (possibly with great efforts by the programmer) any UB(TM).

Once you get rid of all UB(TM) is your program necessarily correct? No, because it could be non-erroneous from the standard perspective, but still buggy because you didn't take into consideration the limits or the capabilities of the real machine, about which the standard doesn't give a damn. [1]

As I said, It took me literally years to grasp the "UB(TM)" meaning (and sometimes I'm still puzzled), putting together pieces of information found in lots of articles read here and there.

BTW, here's a nice article (by renown John Regehr and Pascal Cuoq) about the problems of detecting and getting rid of UB in C and C++ programs. Bottom line: sure it's (sometimes very) hard, but not impossible in principle.

https://blog.regehr.org/archives/1520

It ends with this:

"Unfortunately, C and C++ are mostly taught the old way, as if programming in them isn’t like walking in a minefield. Nor have the books about C and C++ caught up with the current reality. These things must change.

Good luck, everyone."


Cheers!

-- Lorenzo



[1] It is a common complaint from embedded system programmers that C doesn't allow to define the exact sequence of some operations as they are performed when translated to machine code. Thus forcing using assembly snippets in critical code paths.

For example, assuming x and y are 16 bit quantities on an 8 bit MCU,
if you write:

x = <expr1>;
y = <expr2>;

there is no way in C99 to ensure that the updating of x happens completely before the updating of y (the upper 8 bits and the lower 8 bits of each can be modified in any order-usually for optimization purposes).

If x and y are HW registers that need to be accessed in a specific order, you HAVE to use assembly. And declaring x and y volatile doesn't help. This atomic updating problem is addressed only in some later standard IIRC, where some atomic types are introduced.

Failing the correct sequencing could bring the system to a halt or generate a HW exception (maybe depending on the timing on some external event), for example, and this is clearly a "plain UB", but absolutely not an UB(TM), since the abstract machine state is not concerned by what x and y are mapped to.