(cache)The context sensitivity of C's grammar

Context free grammars (CFGs) are a valuable theoretical tool on which the modern compilation theory relies for parsing the code of programming languages. For example, the most popular tool used for parsing – YACC, generates parsers for CFGs. What most people don’t know¹ is that the vast majority of programming languages have grammars that are not context free.

C is a very good example, because it is one of the most popular languages in use and because its grammar is so almost context free that it serves as a good model to demonstrate what I’m talking about.

Now, a CFG has several definitions in relation to formal languages and programming languages. I don’t want to delve too deep into the nomenclature here, but here is a discussion by a bunch of clever guys picking nits off this matter. When I say that the grammar of C is not a CFG, what I mean is that a grammar given to YACC[2] is not enough to parse C correctly, without referring to some context information that comes from elsewhere. It’s time for some examples.

Consider this code:

{
  T (x);
  ...
}

Believe it or not, but given that T is a type, this is actually a valid declaration of x of the type T in C. However, if T is not a known type, this is a call to the function T with the argument x. How can the C parser know which way to parse without knowing whether T was previously defined by a typedef ?

I can hear you say “but this is contrived, who ever writes code like that ?”. OK, something more standard:

{
  T * x;
  ...
}

What is this, a declaration of x as a pointer to T, or a void multiplication of the variables T and x ? There is no way to know without having the table of types defined by typedef in the memory, and parsers aren’t built to do that – this is context sensitive information.

Here’s another example:

func((T) * x);

If T is a type, the result of dereferencing x is cast to T and passed to func. If T isn’t a type, the multiplication of T and x is passed to func.

In all these examples, the parser would be lost without having some information gathered on the code before the problematic statement is reached. Therefore, C cannot be parsed with a YACC grammar without mixing in some context sensitive information. This actually has a name in the compilation / C community – the “typedef-name: identifier” problem. Even K&R2³ has something to say about it, when presenting the grammar for C in the appendix:

With one further change, namely deleting the production typedef-name: identifier and making typedef-name a terminal symbol, this grammar is acceptable to the YACC parser-generator.

So, as you see, C is very close to having a CFG, but isn’t quite there. Fortunately, this problem is very simple to fix. All that’s needed is keeping a symbol table of types defined by typedef as the parsing goes. Whenever a new identifier is recognized in the lexer, it checks if this identifier is a defined type, and returns the correct token to the parser. As far as the parser is concerned, it has two distinct terminals – an identifier and a defined type. All that’s left is updating the symbol table whenever a successful parse of a typedef statement completes. To show better how this works, I’ll show the relevant portions of the C parser and lexer from c2c’s code. Here is a portion of the Lex file:

identifier ([a-zA-Z_][0-9a-zA-Z_]*)

<INITIAL,C>{identifier} 
  { 
    GetCoord(&yylval.tok);  
    yylval.n = MakeIdCoord(UniqueString(yytext), 
                           yylval.tok);
    if (IsAType(yylval.n->u.id.text))
      RETURN_TOKEN(TYPEDEFname);
    else 
      RETURN_TOKEN(IDENTIFIER); 
  }

Without getting too much into the syntax of Lex here, what this basically says is that whenever an identifier is found, it is tested for being a type. If it is, the TYPEDEFname token is returned. Otherwise, IDENTIFIER is returned. For the Yacc grammar, these two are separate terminals.

¹ To be more precise, “most people” don’t even care for things like this. By people I here refer to those who are interested in programming and computer science.

² YACC only accepts CFGs, since in every production rule V -> w, V is a single nonterminal symbol.

³ “The ANSI C programming language, 2nd edition” by Kernighan and Ritchie