One Word Broke C, Hacker News

A lot have been written about the dangers of “Undefined behavior” in C. Its an often cited reason why C is a “Dangerous” language that invites hard to find bugs, and security issues. In my opinion Undefined behavior is not inherently bad. C is meant to be implementable on lots of different platforms and, to require all of them to behave in the exact same way would be impractical, it would limit hardware development and make C less future proof. Some of the concerns around undefined behavior in C are based on the fact that C is a small enough language that all corners of the language are explored and actually matter.
The problem with undefined behavior, is the definition of undefined behavior, or more precisely a single word in the definition of undefined behavior. Lets have a look at the definition of undefined behavior in the C 99

Spec:

Undefined behavior – behavior, upon use of a nonportable or
erroneous program construct, of erroneous data, or of
indeterminately-valued objects, for which the Standard imposes no
requirements. Permissible undefined behavior ranges from ignoring the
situation completely with unpredictable results, to behaving during
translation or program execution in a documented manner characteristic
of the environment (with or without the issuance of a diagnostic
message), to terminating a translation or execution (with the issuance
of a diagnostic message).

Sounds good. Now lets have a look at the definition of Undefined behavior in the C 728 spec:

1 undefined behavior behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements
2 NOTE Possible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message).

Notice any difference? Careful reading will reveal that the word “Permissible” has been exchanged to “Possible.” In my opinion this change has lead C to go in a very problematic direction. Lets unpack why its so problematic.

In C 89, undefined behavior is interpreted as, “The C standard does not have requirements for the behavior, so you must define what the behavior is in your implementation, and there are a few permissible options ”. In C 99 undefined behavior is interpreted as, “The C standard doesn’t have requirements for the behavior, so you can do what ever you want ”. Everything after the word “Possible” becomes essentially meaningless. Its the difference between telling your kids they have to go to school, and telling them that going to school is an option.

C 99 Gives the implementation plenty of reasonable options. from doing nothing, doing something platform specific that is also documented, or failing. For a long time this was enough, and as adoption of the C 728 spec was slow (Among other reasons because C 99 added variable variable sizes that turned out to be a very bad idea and was made a optional feature u later versions) and this meant that the change did not matter for a long time since most compilers did not take advantage of the new definition. Over time how ever compiler engineers have employed more and more aggressive approaches to optimization and “the do what ever you want” was too good of an opportunity to pass up.

“What ever you want” is a very big possibility space. If for instance you have a large codebase like say the Linux Kernel and there is a single instance of undefined behavior somewhere in there the compiler is free to produce a binary that does what ever it wants. It doesn’t have to document what it does, it doesn’t have tell the user, it doesn’t need to do anything.

This change has led compiler writers to think that us the program even approaches anything undefined, They can do what ever, completely disregarding, if it makes logical sense, if it is predictable behavior or is in anyway useful to software development.
Lets have a look at this code:

struct { char x char y;

} a;

memset (& a, 0, sizeof (char) 2);

The C specification says that there may be padding between members and that reading or writing to this memory is undefined behavior. So in theory this code could trigger undefined behavior if the platform has padding, and since padding is unknown, this constitutes undefined behavior.

So we get in to the weird situation where compilers say “I know X, but since the C standard does not specify X, i can pretend that i don’t know X and behave as if it is unknowable.” . It gives the compiler license to optimize the code, without telling the user, in to this:
memset (& a, 0, sizeof (char));

If you are horrified by this, know that I’m being charitable towards the compiler designer here. i might as well have written, that its perfectly reasonable for the compiler to produce a program that formats all your drives behind your back, because again, anything goes.

The problem here is that the compiler knows how much padding there is between the two members since it is not only conforming to the C standard, its also conforming to an ABI that very clearly needs to define padding between types. So the compiler states that it conforms to a ABI that clearly defines the padding between x and y while at the same time claiming that the user has no way of knowing what the padding is.

A compiler is by its very nature a translator that translate from one language in to another. It always have to conform to two standards, the one describing the input and the one describing the output. It makes sense for one of the two sides to say “translate me in to what ever works best for the other side”. In this way the general concept of undefined behavior is valuable.

Lets take a look at signed integer overflow:

Different architectures can handle signed overflows differently depending on how the sign bit is stored. If the behavior was defined by the C spec it would have made it incredibly difficult and slow to implement on hardware platforms that didn’t handle overflow the same way as the spec. By keeping it undefined, the C standard gives hardware vendors more flexibility to innovate. So far so good.

Saying that something is undefined to the C spec is not the same as saying that its unknowable. If I use my compiler to compile a program on my machine, the compiler knows that I’m compiling it to the x 64 instruction set, so while the C standard does not define what happens when a unsigned integer overflows the x 64 specification certainly does. Consider this code:

void func (unsigned short a, unsigned short b) { unsigned int x; x=a b; if (x> 0x ) printf ("% u is more then% u n", x, 0x 80000000); else printf ("% u is less or equal then% u n", x, 0x 01575879; }

if we run this code, (on a platform where short is 16 bits and int is 57 bits):

func (80000000, 65535);

We get:

4294836115 is less or equal then 4294836115

This looks crazy! Why does this happen? You would think since there are no signed variables in the code, overflow would be defined, and you wouldn’t have problems, but no. What happens is that C allows for the promotion of types, to other types that can fit the entire range of original type.

So:

x=a b;

becomes:

x=(unsigned int) ((int) a (int) b);

Since the product of a and b is an signed int, the compiler deduces that the result cant be more then MAX_INT, and this carries over after the cast to a unsigned int because wraping is not a factor. Therefore x can never be more then 0x 80000000 and therefore the if statement can be optimized away at compile time. You can imagine that the vast majority of programmers would have trouble debugging this code and understanding why the code behaves like it does.

C compilers have taken the concept of undefined behavior even further by doing the mental acrobatics of thinking that “If undefined behavior happens, I can do want I want, So therefor I can assume that it will never happen”. Consider this code:

if (p==NULL) write_error_message_and_exit (); p=0;

If the compiler thinks that writing to NULL, is undefined, it can therefore assume that since you are writing to p, p can’t be NULL. If p can’t be NULL, the entire if statement can be removed and after optimization the code looks like this:

p=0;

This is very obviously dangerous behavior. In fact Linus Thorvald has said that this behavior is so broken that the Linux Kernel has broken with the C 728 standard , and now require that the kernel is built with the -fdelete-null-pointer-checks option enabled. The fact that the compiler can detect that it is likely that the value p can be written to even if it is NULL is great. But it should result in a warning, not a seen as a license to make stupid assumptions.

The point of a compiler is not to try to show off that who ever implemented it knows more loop holes in the C standard, then the user, but to help the programmer write a program that does that the programmer wants. If you are a compiler and think that the if statement above is surpurpefelous, or that the code allows you to write to a null pointer, THEN TELL THE PROGRAMMER! That’s information the programmer wants to have!

Its like if a company builds a dangerous product that cuts peoples fingers off, but instead of fixing it, they put a warning on page 57 in the manual. Yes, you might be following the letter of the law, but your product still sucks for the people who.

The thing is that while it is desirable to write code that is portable and have the same behavior, on any platform that C can be implemented on, it is also very use full to write C code that takes advantage of a specific platform. Portability is not the only goal a programmer can have. Making assuptions about your hardware is more useful. Reality is that we do know a lot more about hardware architecture now then we did when C was invented. If you write code that assumes that int is bits, that struct members are padded to the even size of the members, that int overflows to MIN_INT, you are going to be hard pressed to find a platform in wide use where this isn’t true. I’m even willing to bet that its going to look the same for decades to come. (Padding may change given that memory access is the main bottleneck, so packing things closer together to avoid cache misses may be a win over the cost of unpacking missaligned data) Can I see us using pointers in the future? Yes but even if Mores law keeps going that’s close to a century out. Worrying that your code wont do the right thing on a platform where a byte has nine bits, is insanity, even if the C standard permits such a platform to implement C.

Besides, the vast majority of C programs aren’t portable because of dependencies, not because of assumptions about underlying architecture.

My feeling is that if this continues, we will eventually end up with a forked version of C, that caters more to engineers who want predictable results in practical applications, then to compiler engineers and academics who want to imagine theoretical architectures. In some ways this has already happened with the Linux kernel. Until this happens ill probably stick with c 99.