2015-09-20
7 Scandalous Weird Old Things About The C Preprocessor
Introduction
A scandal is defined as something involving questionable moral principles that causes public outrage. This would make the word 'scandal' an excellent qualifier for describing the behaviour of the C preprocessor.
I became aware of many of these features of the C preprocessor when working on my C compiler. The preprocessor is not 100% complete, but it supports recursive function macros. Specifically, it is currently able to successfully preprocess a third-party implementation of a Brainfuck interpreter that was implemented in the C preprocessor. You can try out the demo for your self.
Here are 7 scandalous things I discovered about the C preprocessor when working on my compiler:
1) No Comprehensive Standard
The C99 standard is about 500 pages, but only 19 of them are dedicated to describing how the C preprocessor should work. Most of the specs are high level qualitative descriptions that are intended to leave lots of freedom to the compiler implementor. It is likely that this vagueness in this specification was intentional so that it would not cause existing mainstream compilers (and code) to become non-conforming. Design freedom is good for allowing people to create novel optimizations too, but to much freedom can lead to competing interpretations.
Bjarne Stroustrup even points out that the standard isn't clear about what should happen in function macro recursion[1]. With reference to a specific example he says "The question is whether the use of NIL in the last line of this sequence qualifies for non-replacement under the cited text. If it does, the result will be NIL(42). If it does not, the result will be simply 42.". In 2004, a decision was made to leave the standard in its ambiguous state: "The committee's decision was that no realistic programs "in the wild" would venture into this area, and trying to reduce the uncertainties is not worth the risk of changing conformance status of implementations or programs."
2) Context Free, Just Kidding!
At first glance the C preprocessor looks like you could use context free grammars to describe the language:
#define foo(x) x
foo(foo(foo(a)))
But this quickly falls apart when you consider '#' directives:
#include "somefile.h" /* This works */
(#include "somefile.h") /* This does not */
Preprocessor directives, are very line sensitive. The first non-whitespace, non-comment character of a directive must be a '#' character. The directive always ends on the next newline (or end of file). It is possible to write a preprocessor directive that spans multiple lines, but you must use a line continuator to do so. This works because line continuators are processed before include directives are considered:
#\
i\
n\
c\
l\
ude <stdi\
o\
.\
h\
>\
Many languages are first tokenized, and then the list of tokens doesn't change throughout further processing of the program. In the C preprocessor, new tokens can be created at run time! This makes it impossible to build a parse tree ahead of time because you don't know what tokens would be included in the final tree. For example:
#define function() 123
#define concat(a,b) a ## b
concat(func,tion)()
This will preprocess to:
123
what happened here is that
concat(func,tion)()
was replaced by
function()
which gives us the '123'.
In addition, the concept of a 'token' can be passed as a parameter, then later used to construct different 'code':
#define boo() 123
#define foo(y) boo y )
#define open (
foo(open)
This will preprocess to:
123
what happened here is that
foo(open)
is defined as
boo y )
where y is whatever parameter was passed (which happens to be a '(' character), so you get
boo ( )
which is just a regular function macro call that will give '123'.
3) Whitespace Insensitive, Just Kidding!
In the example below, the extra whitespace completely changes the meaning of the definition of function_mac:
/* This is a function macro */
#define function_mac() something
/* This is actually an object macro with value '() something' */
#define function_mac () something
But later, when we want to invoke function_mac the existence of whitespace doesn't matter at all:
/* Extra whitespace does not matter for function macro invocation. */
function_mac ()
I also encountered whitespace inconsistencies in the gcc's preprocessor (and to a lesser degree clang) that I was never fully able to reverse-engineer. For example, here is an illustration of a case where gcc will delete whitespace between tokens in a macro's definition. As far as I can tell, gcc is deleting the whitespace between the tokens, but only in the case where the identifier being passed in is an identifier for a function macro, and in addition, it seems to make use of the whitespace being passed in as part of the parameter in this case too. This seems to contradict the gcc documentation[1] which says "Leading and trailing whitespace in each argument is dropped, and all whitespace between the tokens of an argument is reduced to a single space."
#define stringify_indirect(x) #x
#define stringify(x) stringify_indirect(x)
#define put_side_by_side(x,y,z) x y z
#define a(x) x
int main(void){
printf(stringify(put_side_by_side(a, a,a)));
printf(stringify(put_side_by_side(a,a,a)));
printf(stringify(put_side_by_side(b, b,b)));
printf(stringify(put_side_by_side(b,b,b)));
}
$ gcc -E main.c
int main(void){
printf("a aa");
printf("aaa");
printf("b b b");
printf("b b b");
}
$ clang -E main.c
int main(void){
printf("a a a");
printf("a a a");
printf("b b b");
printf("b b b");
}
4) Function Argument Pre-expansion
Function macros in the C preprocessor are unlike functions that you see in C, or in most other languages that you might encounter. One of the striking differences is that you can't reason about what the result of a function macro will be by simply evaluating inner macros result, and then substituting this result into the outer function call (like you could in C).
In general, when evaluating a function macro body, you need to consider both the pre-expanded version of the arguments, and the untouched tokens that were passed for that argument. This behaviour is unlike how C arguments and functions are evaluated, because in C you can always replace an argument that's described by an expression with the result of that expression and have the same meaning (ignoring any side effects).
For example, the output of
#define boo() 123
#define foo(x) x #x
foo(boo())
is
123 "boo()"
If we first evaluated 'boo()' and substituted it like this:
#define boo() 123
#define foo(x) x #x
foo(123)
we would get
123 "123"
which is wrong. This is also important when considering the token concatenation operator (##).
Keeping track of these 2 different contexts for every function macro call can get difficult to follow mentally, since the C Preprocessor evaluates from outside in during the argument pre-expansion phase, then inside out as function arguments are substituted, all the while referencing the non pre-expanded arguments for stringification and token concatenation.
5) Function Macros And Recursion
The main source of complexity in the C preprocessor comes from function macros, specifically when dealing with recursion. The C preprocessor does not allow unbounded recursion, so once we've encountered a function invocation for the second time in a recursion, those tokens are 'disabled' from future expansion. This extra rule for disabling macros and tokens adds even more complexity to the recursion.
The following example is a minimal test case for a bug I found in my preprocessor involving macro disabling. This example should illustrate how difficult it can be to reason about function macros in practice. I've skipped many of the most detailed steps like all of the macro disables, re-enables, and argument pre-expansion phases for trivial function macro calls.
#define recur4(C, T, E) C-T-E
#define recur3(X) [ X ]
#define recur2(C, X) recur4(C(X), recur4(C(X), ,),) |C|
#define recur1(F, X) F(recur3, X)
recur1(recur2, recur1(recur2, 1))
Let's go through the expansion of this step by step:
So we want to evaluate:
(1) recur1(recur2, recur1(recur2, 1))
The first thing we do is collect the tokens that are inside of recur1(...). This gives us:
(2) recur2, recur1(recur2, 1)
Next we preform argument pre-expansion on these tokens to evaluate any macros that are inside of the function call. In general, we need to take note of exactly what tokens we want to pre expand, because the function macro body can make use of the fully argument pre-expanded version, or it can make use of the literal tokens that were passed in. For example, the stringify operator (#) will create a string literal of the tokens that were passed as a function macro argument, without pre-expanding them first.
When evaluating
(2) recur2, recur1(recur2, 1)
for macros, we can see that recur2 represents the name of a function macro. Since the token 'recur2' is not followed by a parenthesis, this does not represent a call to a function macro, so we just leave this token alone. Next we can consider
(3) recur1(recur2, 1)
which is another function macro call.
Inspecting the contents of this function macro invocation, we see that it contains these tokens:
recur2, 1
Since 'recur2' is an identifier for a function macro (and not an object macro) we can see that the fully argument pre-expanded arguments for (abc) are simply:
(4) recur2 /* For arg 1 */
(5) 1 /* For arg 2 */
We can now substitute (4) and (5) into the definition of recur1, as it was called in (3). Also note that because were able to finish the argument pre-scan for (3) we now disable the macro 'recur1' as we evaluate it, This gives us
(6) recur2(recur3, 1) *
* 'recur1' currently disabled. When the preceding token is consumed, we can re-enable this macro.
This again gives us something that needs to be macro-expanded. The pre-expanded argument in this case are
(7) recur3 /* For arg 1 */
(8) 1 /* For arg 2 */
Substituting these into the definition of recur2 gives
(9) recur4(recur3(1), recur4(recur3(1), ,),) |recur3| * **
* 'recur1' currently disabled. When the preceding token is consumed, we can re-enable this macro.
** 'recur2' currently disabled. When the preceding token is consumed, we can re-enable this macro.
I'll talk more about the above disablings later. Searching for macros to expand in the result requires that we expand
(10) recur4(recur3(1), recur4(recur3(1), ,),)
Therefore, we must argument pre-expand
(11) recur3(1), recur4(recur3(1), ,),
Starting with the first argument in (11), the result will be
(12) [ 1 ]
The second argument in (11) requires the evaluation of
(13) recur4(recur3(1), ,)
Performing pre-argument expansion of the first argument of (13) gives
(14) [ 1 ]
Therefore, the pre-expanded arguments to the function macro call in (13) are
(14) [ 1 ] /* First arg to (13) */
(15) /* Second arg to (13) */
(16) /* Third arg to (13) */
Using the definition of 'recur4' we can evaluate (13):
(17) [ 1 ]- -
We can now state the entire pre-expanded argument list from (11) using (12) and (17):
(18) [ 1 ],[ 1 ]- -,
Therefore, the pre-expanded arguments for the function macro invocation in (10) are as follows:
(19) [ 1 ] /* First arg */
(20) [ 1 ]- - /* Second arg */
(21) /* Third arg */
Substituting (19), (20) and (21) into the definition of 'recur4' gives:
(22) [ 1 ]-[ 1 ]- - -
which is the result of (10). Substituting (10) into (9) gives:
(23) [ 1 ]-[ 1 ]- - - |recur3|
(23) is also the fully macro-expanded version of (6), which can be substituted into (2):
(24) recur2, [ 1 ]-[ 1 ]- - - |recur3|
We're halfway there now that we've fully performed the argument pre-scan of (1) and we're working with the equivalent of:
(25) recur1(recur2, [ 1 ]-[ 1 ]- - - |recur3|)
Using the definition of 'recur1' we end up with:
(26) recur2(recur3, [ 1 ]-[ 1 ]- - - |recur3|)
Using the definition of 'recur2' we end up with:
(27) recur4(recur3([ 1 ]-[ 1 ]- - - |recur3|), recur4(recur3([ 1 ]-[ 1 ]- - - |recur3|), ,),) |recur3|
Now we re-scan the replacement for macros we can evaluate, and we find only this one:
(28) recur4(recur3([ 1 ]-[ 1 ]- - - |recur3|), recur4(recur3([ 1 ]-[ 1 ]- - - |recur3|), ,),)
We need to apply our good old friend argument pre-expansion to the tokens inside 'recur4':
(29) recur3([ 1 ]-[ 1 ]- - - |recur3|), recur4(recur3([ 1 ]-[ 1 ]- - - |recur3|), ,),
Looking at (29) we see that there are 3 arguments to evaluate to get the result of (28):
(30) recur3([ 1 ]-[ 1 ]- - - |recur3|) /* Arg 1 */
(31) recur4(recur3([ 1 ]-[ 1 ]- - - |recur3|), ,) /* Arg 2 */
(32) /* Arg 3 */
Starting with (30) we get:
(33) [ [ 1 ]-[ 1 ]- - - |recur3| ]
Now evaluating (31) we must perform pre-argument expansion on these 3 arguments
(34) recur3([ 1 ]-[ 1 ]- - - |recur3|) /* Arg 1 */
(35) /* Arg 2 */
(36) /* Arg 3 */
Evaluating (34) gives
(37) [ [ 1 ]-[ 1 ]- - - |recur3| ]
Since (35) and (36) are empty we can now evaluate (31) using the definition of 'recur4' to the following:
(38) [ [ 1 ]-[ 1 ]- - - |recur3| ]- -
Since (38) is the evaluated form of (31), we can now evaluate (28) using the definition of 'recur4':
(39) [ [ 1 ]-[ 1 ]- - - |recur3| ]-[ [ 1 ]-[ 1 ]- - - |recur3| ]- - -
Since (28) was the only macro inside of (27), we can substitute the evaluated form of (28) into (27):
(40) [ [ 1 ]-[ 1 ]- - - |recur3| ]-[ [ 1 ]-[ 1 ]- - - |recur3| ]- - - |recur3|
(40) is the evaluated form of (27), which is the evaluated form of (26), which is the evaluated form of (25), which the evaluated form of (1). Therefore, the macro expansion of
recur1(recur2, recur1(recur2, 1))
is
[ [ 1 ]-[ 1 ]- - - |recur3| ]-[ [ 1 ]-[ 1 ]- - - |recur3| ]- - - |recur3|
but that probably obvious.
What ended up being the bug in my preprocessor for this test case was in evaluating the argument prescanned tokens that were passed to a function macro invocation. If the function macro being invoked contained a token with the same identifier as the function macro being invoked, but didn't actually call that function macro, it would incorrectly disable this token. This bug combined with another bug where tokens were copied by reference would mean that disabling one token could disable it in other places, so the outer valid macro invocation's token would be disabled and never able to expand again. The bug was around step (34).
I should point out that in the above example, I showed evaluation of inner macros and then 'substituted' them into the outer function calls, but this is only correct for this specific example. In general, when evaluating a function macro body, you need to consider both the pre-expanded version of the arguments, and the untouched tokens that were passed for that argument. This is explained above in more detail in the section on argument pre-expansion.
One final note is that in the above example, I made a couple references to macros being 'disabled'. In fact in the above example, there were many cases where macros were disabled and re-enabled, but I didn't show this. When we finish the argument pre-expansion phase for evaluating a function macro, we then disable that macro. This is to prevent unbounded recursion, so if we end up recursively calling the same function macro again we will see that it is disabled. If we observe a token corresponding to a function macro that is disabled and then try to use that token to call that function macro, we must disable that token (Note that we're talking about disabling tokens and disabling macros which are two different things).
6) Digraphs and Trigraphs
Digraphs and trigraphs offer an alternative way to type in certain characters like '{', '}', '[', etc. There are a few esoteric reasons why you would want to do this (like working in a restricted character set, or with a keyboard that doesn't have these keys). Digraphs and trigraphs rarely show up in practice, but compilers usually do support them.
For example, you could re-write this program
int main(void){
int arr[1] = {1};
return arr[0];
}
using digraphs
int main(void)<%
int arr<:1:> = <%1%>;
return arr<:0:>;
%>
or trigraphs
int main(void)??<
int arr??(1??) = ??<1??>;
return arr??(0??);
??>
and they all mean the same thing. In gcc you'll need to use the '-trigraphs' flag to compile the last example.
Trigraphs are a seldom used feature today, and they have been proposed for removal from C++17[1]. Their presence is something that can lead to bugs, since trigraph replacement is the very first preprocessing step done on a file (even before line continuators are considered)[1] For example, a trigraph can alter the meaning of a comment in a way that affects the program execution:
#include <stdio.h>
int main(void){
// Print some stuff??/
printf("abc\n");
printf("def\n");
return 0;
}
compiling and running using gcc -trigraphs produces:
$ gcc main.c -trigraphs && ./a.out
def
This is because the trigraph replacement results in
#include <stdio.h>
int main(void){
// Print some stuff\
printf("abc\n");
printf("def\n");
return 0;
}
where the comment ends in a line continuator '\', which gets further preprocessed to
#include <stdio.h>
int main(void){
// Print some stuff printf("abc\n");
printf("def\n");
return 0;
}
so only one print statement actually gets seen by the compiler.
7) '#' Directive Syntax
As mentioned in section 2, preprocessing directives are line-sensitive, in that each preprocessing direcive consists of a single line. This behaviour is unique to any other use of the C preprocessor (function macros can be invoked over multiple lines) or syntax found in C (although tokens in C may not be split across multiple lines).
The syntax of the various '#' directives like 'include', 'define', 'ifdef' that you've seen in practice are likely fairly straightforward. For the 'include' directive, you've pretty much only got two choices. Something like this for implementation dependent include searches:
#include <something.h>
or something like this for relative includes:
#include "something.h"
Looks like it would be pretty easy to parse, right? Well, let me destroy your world with this example:
#define foo <stdio.h>
/* foo? */ # /* abc */ include /* lolololol */ foo
int main(void){
printf("Using stdio.h\n");
return 0;
}
Which compiles just fine with no warnings using 'clang -Weverything main.c'
If you look at some of the early submissions to the obsfucated c contest[1], it appears that the compilers of the time even allowed you to do things like this:
#define d define
#d defined_macro_1 123
#d defined_macro_2 123
This behaviour has been disallowed by the standard for some time now, as is stated in C89 Section 3.8.3.4 Rescanning and further replacement:
"The resulting completely macro-replaced preprocessing token sequence is not processed as a preprocessing directive even if it resembles one."
Conclusion
Hopefully, you've learned something new from this article. If you found it interesting, you might want to check out the demo of my C preprocessor.
C Programming
Published 2017-03-30 Modelling C Structs And Typedefs At Parse Time | Subscribe to New Posts | Published 2017-03-30 An Artisan Guide to Building Broken C Parsers | Follow @RobertElderSoft |
Published 2016-10-27 How to Get Fired Using Switch Statements & Statement Expressions | Published 2016-07-21 Building A C Compiler Type System - Part 2: A Canonical Type Representation | Published 2016-07-07 Building A C Compiler Type System - Part 1: The Formidable Declarator | |
Published 2016-05-25 GCC's Signed Overflow Trapping With -ftrapv Silently Doesn't Work | Published 2016-05-09 The Magical World of Structs, Typedefs, and Scoping | Published 2016-04-09 Be Careful When Using Scoped Structures In C | Published 2015-08-16 Should I use Signed or Unsigned Ints In C? (Part 2) |
Published 2015-07-27 Should I use Signed or Unsigned Ints In C? (Part 1) | Published 2015-05-25 Strange Corners of C |