What Every C Programmer Should Know About Undefined Behavior

b3morales · on Sept 8, 2021

And the counterpoint, [What Every Compiler Writer Should Know About Programmers][0] (PDF only, unfortunately).

Also for further reading see much of John Regehr's blog, e.g. [Taming Undefined Behavior in LLVM][1].

[1]:https://blog.regehr.org/archives/1496/

deaddabe · on Sept 8, 2021

Nice read, I recommend the 3 articles of the series.

One thing that I find artificial in the examples is to use signed integers for loops. I always have seen unsigned integers in practice. In general I try to avoid signed integers like plague for big values storage and only use them for error propagation, with negative values for errors like the libc does.

Someone · on Sept 8, 2021

I don't know whether the standard guarantees you can index into arrays with large unsigned ints (it says "integer type”. I don’t know whether that means it will support both without silently converting them, or whether that means an implementation can pick one)

Also, I don’t know whether it (¿still?) is valid C, but I’ve used negative indexes in the past in graphics applications, where it is convenient to have an 'array' with indices that dont start at zero. One-dimensional example:

   #define N 100
   char a[2 * N + 1];
   char * b = a + 100;
   ...
   for(int x = -N; x <= N; ++x) b[x] = 42;

avianes · on Sept 10, 2021

There is an example of negative array indexing within standard lib implementations.

Functions like isprint(int c), which checks whether c is a printable ASCII character, can take as input numbers between -128 and 127 or between 0 and 255.

To achieve this, implementations usually rely on single array indexable from -128 to 255 (a lookup table).

Example from the newlib: https://sourceware.org/git/?p=newlib-cygwin.git;a=blob;f=new...

And the array came (indirectly) from: https://sourceware.org/git/?p=newlib-cygwin.git;a=blob;f=new...

tpoacher · on Sept 9, 2021

omg that looks horrible! don't do that! xD

junon · on Sept 8, 2021

The benefit of using signed integers these days is that compilers can utilize signed integer overflow to trap, which is helpful in debugging, since it's undefined behavior whereas unsigned overflow is well defined.

flqn · on Sept 8, 2021

Signed integral operations are often faster too, since the well-definedness of unsigned overflow can require additional instructions esp. if the width of the unsigned integral is not the same as the architecture word width.

deaddabe · on Sept 8, 2021

Interesting, so that would mean that `int` would be faster than `unsigned int` in tight loops because the compiler would not have to check for overflow and do the wrapping (if the upper bound is unknown so that it cannot be optimized out).

I think that in Rust the unsigned types do not wrap by default, so that the compiler does not have to introduce these checks at runtime. If one wants wrapping, they have to directly use the wrap types instead which would be more instruction-heavy.

I wonder if there is a way in C to reproduce this mechanism in order to not generate any check instructions as well for unsigned types.

junon · on Sept 8, 2021

> I wonder if there is a way in C to reproduce this mechanism in order to not generate any check instructions as well for unsigned types.

Correct me if I'm wrong (I probably am, don't trust me) but simply using `unsigned` as the type gives you the most optimal signed unsigned integer type for the system.

dumael · on Sept 8, 2021

'signed' and 'unsigned' on their own--act as short-hand for 'signed int' and 'unsigned int' in C and C++.

Note that the size of an 'int' is dependant on the "data model"[1]. As for whether it's the most optimal is far too context dependant.

A data model that defaults 'int's to 32 bits on today's (and yesterday's) architectures is fine in many cases as the range of that type is acceptable for most usages without excessive wastage.

Certain data models do specify that 'int' is 64 bits which can break some programmer's assumptions, and also lead to space wastage as a struct member or a stack slot has to have 64 bits allocated for it on paper.

Data models are part of the ABI your program uses, so it's not necessarily optimal for any given system.

[1] Wikipedia has a table summarizing some of the differences: https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_m...

celegans25 · on Sept 8, 2021

I am not sure if this generalizes to non-64 bit platforms, but in my experience using size_t gives you unsigned integer indexing without requiring overflow checks (as it's normally a 64 bit quantity).

deaddabe · on Sept 8, 2021

This would make sense, since this type is almost always used for iterations and sizes. Anyways I have played with Compiler Explorer and could not spot the "overflow checks" inside some dumb code: https://godbolt.org/z/4znWG8GoP

Maybe this is because x86_64 instruction set is already wrapping up unsigned integers, and maybe the majority of instruction sets are also.