Pointers Are More Abstract Than You Might Expect in C
A pointer references a location in memory and dereferencing a pointer refers to the lookup of the value of the memory location the pointer references.
The value of a pointer is a memory address.
The C standard does not define the representation of a memory address.
This is crucial since not every architecture makes use of the same memory addressing paradigm.
Most modern architectures make use of a linear address space or something similar.
Still, even this is not precise enough since you might want to talk about physical or virtual addresses.
Some architectures make even use of non-numeric addresses.
For example, the Symbolics Lisp Machine makes use of tuples of the form (object, offset)
as addresses.
The representation of a pointer is not defined by the C standard. However, operations involving pointers are defined—at least more or less. In the following we will have a look at these operations and how they are defined. Lets start with an introductory example:
#include <stdio.h>
int main(void) {
int a, b;
int *p = &a;
int *q = &b + 1;
printf("%p %p %d\n", (void *)p, (void *)q, p == q);
return 0;
}
If compiled with , then a run of the program on a x86-64 Linux system prints:
0x7fff4a35b19c 0x7fff4a35b19c 0
Note that the pointers p
and q
point to the same memory address.
Still the expression p == q
evaluates to false
which is very surprising at first.
Wouldn’t one expect that if two pointers point to the same memory address, then they should compare equal?
The C standard defines the behavior for comparing two pointers for equality as follows:
C11 § 6.5.9 paragraph 6
Two pointers compare equal if and only if both are null pointers, both are pointers to the same object (including a pointer to an object and a subobject at its beginning) or function, both are pointers to one past the last element of the same array object, or one is a pointer to one past the end of one array object and the other is a pointer to the start of a different array object that happens to immediately follow the first array object in the address space.
The first question which probably comes up is: What is an object
?
Since we consider the language C it has certainly nothing to do with objects as known from object oriented programming languages like C++.
The C standard defines an object rather informally as:
C11 § 3.15
object
region of data storage in the execution environment, the contents of which can represent values
NOTE When referenced, an object may be interpreted as having a particular type; see 6.3.2.1.
Lets be nit-picky. A 16 bit integer variable in memory is a data storage and can represent 16 bit integer values. Therefore it is an object. Should two pointers compare equal if the first pointer points to the first byte of the integer and the second pointer to the second byte of the integer? Of course this is not what the language committee intended. But at that point we should note that the language is not formally defined and we have to start guessing what the intention of the language committee was.
When the Compiler Gets Into Your Way
Lets get back to our introductory example.
Pointer p
is derived from object a
and pointer q
is derived from object b
.
The latter involves pointer arithmetics and this is defined for the operators plus and minus as follows:
C11 § 6.5.6 paragraph 7
For the purposes of these operators, a pointer to an object that is not an element of an array behaves the same as a pointer to the first element of an array of length one with the type of the object as its element type.
Since every pointer which points to a non-array object is virtually lifted to a pointer of type array of length one, the C standard only defines pointer arithmetics for pointers of array types which is finally given in paragraph 8. The interesting part for our case is:
C11 § 6.5.6 paragraph 8
That means, the expression &b + 1
should evaluate to an address without any problem.
Hence p
and q
should be valid pointers.
Recap what the C standard defines for comparing two pointers: Two pointers compare equal if and only if […] one is a pointer to one past the end of one array object and the other is a pointer to the start of a different array object that happens to immediately follow the first array object in the address space
(C11 § 6.5.9 paragraph 6).
This is exactly the case in our example.
Pointer q
points one past the end of object b
which immediately follows object a
to which p
points.
Is this a bug in GCC?
The finding has been reported in 2014 as bug #61502 and so far the GCC people argue that this is not a bug and therefore won’t fix it.
The Linux people ran into a similar problem in 2016. Consider the following code:
extern int _start[];
extern int _end[];
void foo(void) {
for (int *i = _start; i != _end; ++i) { /* ... */ }
}
The symbols _start
and _end
are used to span a memory region.
Since the symbols are externalized, the compiler does not know where the arrays are actually allocated in memory.
Therefore, the compiler must be conservative at this point and assume that they may be allocated next to each other in the address space.
Unfortunately GCC compiled the loop condition into the constant true rendering the loop into an endless loop as described in this LKML post where they make use of a similar code snippet.
It looks like that GCC changed its behavior according to this problem.
At least I couldn’t reconstruct the behavior with GCC version 7.3.1 on x86_64 Linux.
Defect Report #260 to the Rescue?
Defect report #260 may apply in our case. The topic of the report is more about indeterminate values, however, there is one interesting response from the committee:
Implementations […] may also treat pointers based on different origins as distinct even though they are bitwise identical.
If we take this literally, then it is sound that p == q
evaluates to false, since p
and q
are derived from distinct objects that are in no relation to each other.
It looks like we are getting closer and closer to the truth, or do we?
So far we only considered operators for equality but what about relational operators?
Relational Operators to the Final Rescue?
An interesting point is made while defining the semantics of the relational operators <
, <=
, >
, and >=
, in order to compare pointers:
C11 § 6.5.8 paragraph 5
When two pointers are compared, the result depends on the relative locations in the
address space of the objects pointed to. If two pointers to object types both point to the
same object, or both point one past the last element of the same array object, they
compare equal. If the objects pointed to are members of the same aggregate object,
pointers to structure members declared later compare greater than pointers to members
declared earlier in the structure, and pointers to array elements with larger subscript
values compare greater than pointers to elements of the same array with lower subscript
values. All pointers to members of the same union object compare equal. If the
expression P
points to an element of an array object and the expression Q
points to the
last element of the same array object, the pointer expression Q+1
compares greater than
P
. In all other cases, the behavior is undefined.
According to this definition comparing pointers is only defined behavior if the pointers are derived from the same object. Lets demonstrate the idea of this by two examples.
int *p = malloc(64 * sizeof(int));
int *q = malloc(64 * sizeof(int));
if (p < q) // undefined behavior
foo();
In this example the pointers p
and q
point into two different objects which are not related to each other.
Hence comparing them is undefined behavior.
Whereas in the following example
int *p = malloc(64 * sizeof(int));
int *q = p + 42;
if (p < q)
foo();
the pointer p
and q
point into the same object and are therefore related.
Hence it is sound to compare them—assuming that malloc
does not return null.
Wrap-up
With respect to pointer comparison the C11 standard is not stringent. The most problematic part we stumbled across was in § 6.5.9 paragraph 6 where it is explicitly allowed to compare two pointers from two different array objects. This contradicts to the statement made in defect report #260. However, the overall topic in DR#260 is about indeterminate values. Therefore, I do not feel comfortable to build up may chain of arguments just of this single sentence and interpret it in a somewhat different context. If we lookup the definition of the relational operators w.r.t. pointer comparison, we observe a slightly different wording than for the equality operators. There, the standard only defines them if both operands are derived from the very same object.
If we step back from the standard and ask our self does it make sense to compare two pointers which are derived from two completely unrelated objects?
The answer is probably always no.
The introductory example is rather an academic problem.
Since the variables a
and b
have automatic storage duration you typically do not want to make assumptions about their memory arrangement.
It might work out in this or that case, but such code is definitely non-portable and you never ever can say what the meaning of the program is prior you compile and run/disassemble the code—which is against any meaningful programming paradigm.
Still the overall feeling regarding the wording of the C11 standard is unsatisfactory and since several people already stumbled across this problem, the question which is left is: Why not make the wording more precise?
Addendum
Pointers One Past the Last Element of an Array
If we lookup the C11 standard and read about pointer arithmetics and comparison we find exceptions for pointers which point one past the last element of an array all over the place. Assume it wouldn’t be allowed to compare two pointers derived from the same array object where at least one pointer points one element past of the array, then code like this
const int num = 64;
int x[num];
for (int *i = x; i < &x[num]; ++i) { /* ... */ }
would not work.
Via the loop we iterate over the array x
consisting of 64 elements, i.e., the loop body should be evaluated exactly 64 times.
However, the loop condition gets evaluated 65 times—once more than we have array elements.
In the first 64 evaluations, the pointer i
always points into the array x
whereas the expression &x[num]
always points one element past the array.
In the 65th iteration the pointer i
also points one element past the array x
rendering the condition of the loop false.
This is a convenient way to iterate over an array which makes the exception for arrays feasible.
Note, the standard only defines the behavior of comparing such pointer—dereferencing pointer is another topic.
Can we change the example such that no pointer points one past the last element of array x
?
Well, the solution to that is not straight forward.
We have to change the loop condition and also make sure that at the end of the loop we do not increment i
anymore.
const int num = 64;
int x[num];
for (int *i = x; i <= &x[num-1]; ++i) {
/* ... */
if (i == &x[num-1]) break;
}
This code is rather cluttered with technical details which we do not want to deal with and which distracts us from the actual job we want to accomplish. Despite that it also contains one additional branch inside the loop body. Hence, I think it is reasonable to have exceptions for pointers one past the last element of an array.