Context free grammars (CFGs) are a valuable theoretical tool on which the modern compilation theory relies for parsing the code of programming languages. For example, the most popular tool used for parsing – YACC, generates parsers for CFGs. What most people don’t know1 is that the vast majority of programming languages have grammars that are not context free.
C is a very good example, because it is one of the most popular languages in use and because its grammar is so almost context free that it serves as a good model to demonstrate what I’m talking about.
Now, a CFG has several definitions in relation to formal languages and programming languages. I don’t want to delve too deep into the nomenclature here, but here is a discussion by a bunch of clever guys picking nits off this matter. When I say that the grammar of C is not a CFG, what I mean is that a grammar given to YACC[2] is not enough to parse C correctly, without referring to some context information that comes from elsewhere. It’s time for some examples.
Consider this code:{ T (x); ... }
Believe it or not, but given that T
is a type, this is actually a valid declaration of x
of the type T
in C. However, if T
is not a known type, this is a call to the function T
with the argument x
. How can the C parser know which way to parse without knowing whether T
was previously defined by a typedef
?
I can hear you say “but this is contrived, who ever writes code like that ?”. OK, something more standard:
{ T * x; ... }
What is this, a declaration of x
as a pointer to T
, or a void multiplication of the variables T
and x
? There is no way to know without having the table of types defined by typedef
in the memory, and parsers aren’t built to do that – this is context sensitive information.
Here’s another example:
func((T) * x);
If T
is a type, the result of dereferencing x
is cast to T
and passed to func
. If T
isn’t a type, the multiplication of T
and x
is passed to func
.
In all these examples, the parser would be lost without having some information gathered on the code before the problematic statement is reached. Therefore, C cannot be parsed with a YACC grammar without mixing in some context sensitive information. This actually has a name in the compilation / C community – the “typedef-name: identifier” problem. Even K&R23 has something to say about it, when presenting the grammar for C in the appendix:
With one further change, namely deleting the production typedef-name: identifier and making typedef-name a terminal symbol, this grammar is acceptable to the YACC parser-generator.
So, as you see, C is very close to having a CFG, but isn’t quite there. Fortunately, this problem is very simple to fix. All that’s needed is keeping a symbol table of types defined by typedef
as the parsing goes. Whenever a new identifier is recognized in the lexer, it checks if this identifier is a defined type, and returns the correct token to the parser. As far as the parser is concerned, it has two distinct terminals – an identifier and a defined type. All that’s left is updating the symbol table whenever a successful parse of a typedef statement completes. To show better how this works, I’ll show the relevant portions of the C parser and lexer from c2c’s code. Here is a portion of the Lex file:
identifier ([a-zA-Z_][0-9a-zA-Z_]*) <INITIAL,C>{identifier} { GetCoord(&yylval.tok); yylval.n = MakeIdCoord(UniqueString(yytext), yylval.tok); if (IsAType(yylval.n->u.id.text)) RETURN_TOKEN(TYPEDEFname); else RETURN_TOKEN(IDENTIFIER); }
Without getting too much into the syntax of Lex here, what this basically says is that whenever an identifier is found, it is tested for being a type. If it is, the TYPEDEFname
token is returned. Otherwise, IDENTIFIER
is returned. For the Yacc grammar, these two are separate terminals.
1 To be more precise, “most people” don’t even care for things like this. By people I here refer to those who are interested in programming and computer science.
2 YACC only accepts CFGs, since in every production rule V -> w
, V
is a single nonterminal symbol.
3 “The ANSI C programming language, 2nd edition” by Kernighan and Ritchie