Introduction
You can find the GitHub repository for this project here. All the source code will be updated as we go on with the project, so feel free to go back through the commit history to grab the revision you wish.
This project is aimed at people with not too much experience in creating programming languages. I will try to cover as much ground as possible, but if you find anything you would like improved/changed, or explained in more detail, please send me an email at nevetia.vedant@gmail.com. I have to point out before we begin that I myself am not an expert at programming languages. I find that it can be a mammoth task searching for a good resource to learn the basics of programming languages, and that books are (at least for me) an overwhelming resource for someone with close to zero experience. My aim with these set of posts is to try and provide a good foundation for people to begin building their own programming languages. I am also learning a lot in the process, so if anyone with more experience than me spots any mistakes, please do let me know via email, or better yet, send me a PR on GitHub with the change. I will cover the core basics before we begin writing any code, so please bare with me.
I don’t expect you to understand all the concepts below at once, so please take your time and re-read the information provided until you get a good idea of the concept. Also, Google is your friend. If my attempt at explaining to concept was still lacking and you’re still having problems understanding the concept, please do Google said concept and look for a decent explanation. I’ve found Wikipedia to be a good resource too.
Prerequisites
I expect you to have a computer with Linux/Mac installed. Windows would work too but can be harder to set up. That said, Linux is a free distribution, and is handy if you’re into a lot of programming, so feel free go grab a copy and install it on a virtual machine or a fresh partition if you’re feeling up to it.
You will need a compiler to compile the C code that we write. On Mac, downloading the Command Line Tools gives you access to Clang/GCC and their C++ variants. We’re going to be using Clang. On Linux, considering apt is your package manager, run sudo apt-get install clang to get the Clang compiler installed.
Next up, you need a decent text editor to use. I personally prefer a combination of Emacs and TextMate (on Mac). You can use one of your choice. I don’t recommend an IDE, for various reasons. If you’re on Linux, you can use Emacs, or Sublime Text if you don’t find yourself comfortable using Emacs.
Let’s get on with some actual learning now…
The Basics
What is a programming language?
A programming language is actually just a set of rules to manipulate a computer. Really, that’s all it is. Its a bunch of words and symbols in a given sequence which is defined by you, that makes the computer perform the given instruction. However, programming languages can be interpreted by a computer to give meaningful output. This interpretation is done by a compiler. A compiler is a program that takes this file that contains the instructions you provided and tries to get the computer to understand it. Now, you probably already know that a computer only understands 0’s and 1’s, then how will it understand this human-readable gibberish we just fed it? Again, a compiler takes these instructions and tries to convert it into the closest form of those 0’s and 1’s. In the end, everything is like a mathematical calculation.
When you write C code and call the compiler (GCC/Clang) on it, it takes this code and produces an object file, also known as machine code. If you’ve ever opened up this object file, you’ll just see a bunch of meaningless numbers. However, a machine understands this and is able to do what you asked it to.
Types of programming languages: (i will link their wikis as they’re a good read for understanding the differences, there would be no point in me explaining it all over again)
Basically, procedural programming langauges are based on on method calls in a given structure. Best example of a procedural language is C. You have a main function, in which you define a particular order in which other functions are called and what they’re supposed to do.
Object oriented programming languages rely on the concept of message passing between objects. They revolve around the concept that everything in a program is an object that can be interacted with in order to perform a particular task. An example of this is Java.
Functional programming languages are based on the concept of mathematical functions. Every function always returns a value and operates on a fixed data set. An example of this is Scheme.
Compilers and their types
A compiler can be implemented in many ways, although finally it just makes the computer understand and implement what you instructed it to do.
The first is a normal compiler - it takes the source code file you passed to it and generates machine code. It reads the entire source file, scanning for errors, etc. and, if none, generates the machine code and produces an executable that you can run to see your desired output. GCC/Clang and Go’s compiler are two examples of this.
The next is an interpreter - this is a different approach to executing programs. An interpreter is usually coupled with a virtual machine (more about this later) to execute instructions line-by-line, i.e. at runtime. Every instruction is read and executed simultaneously, throwing errors (if any) during execution. Python and Ruby are examples of this. Interpreted languages need a virtual machine to run but languages using virtual machines do not need to be interpreted. An example of this is Java. It is a compiled programming language, however it compiles its programs to bytecode, that its virtual machine understands and executes. Java bytecode is the equivalent of machine code, except that it targets the JVM (Java Virtual Machine) rather than the actual user’s host machine.
There are other types of compilers out there, but they usually borrow from these two main types.
Virtual Machines
Think of a virtual machine to be a computer itself. A very simple, very small computer, residing in your computer. Its made programatically, so there’s no actual computer sitting in there. This machine is usually given a simple instruction set to execute, ergo you can make your own programming language compile down to the instructions this machine understands, which can make things a tad bit easier. We’ll see how to implement this later.
Parts of a compiler
Every programming language’s compiler will consist of a few parts that work together to collectively be called a compiler. These parts are essential to the working of the compiler.
- Lexer
- Parser
- Semantic Analyser
- Code Generator
A lexer is a program that takes the input source file and tokenizes it - it takes each word and assigns to it a token type. This token type denotes the type of the toke, namely whether its a keyword or an identifier or a number or a bracket. Why does it do this? Well, a compiler needs to take in some input data and it doesn’t understand something like int x = 5; right off the bat- it needs to know what int is and what x is and what to do when it encounters these. This is done by the lexer, which assigns to each token a type that gives the compiler an idea of what its working with. Its also used by the parser to generate an Abstract Syntax Tree, or AST.
The parser, as mentioned above, is used to generate an abstract syntax tree. An Abstract Syntax Tree, or AST for short, is a tree-like representation of the programming language’s structure. Its got several advantages over other methods. So why do we need an AST? Well, to understand that we must understand what a grammar is. Every programming language has a grammar (rules) representing the way in which syntax in that language is to be structured/represented. The most common type of grammar is a context-free grammar. It is widely used to describe the grammar for programming languages because it makes it easy to write a fairly accurate grammar for any ambiguous programming language. Grammars are essential to any programming language and form the base of the parser. You always need to define a grammar in order to have a fully functional parser.
We now need a way to denote the context free grammar, and we’re going to be using the Backus-Naur form, or BNF. There is also EBNF (Extended Backus-Naur Form) which is used for more subtle grammar definitions.
These are widely used grammar notations for a variety of programming languages. For example, an if statement in C has the form:
if (expression) statement else statement
Here we see that expression is a placeholder for any valid expression. If you notice, we’re not being explicit about a particular type of expression. We’ve defined a generic structure for an if statement that provides a valid working structure that the parser must follow when parsing, and throw errors if the structure of a given if statement does not match the current structure.
In simple words, a grammar helps define a generic structure that every statement must follow for the parser to flag it as valid.
Now back to what an AST is and what it has to do with this.
Here’s an example AST for the sample program in the image. It seems really complex at first but its not. Everything in an AST is a node. An interconnection of nodes in a particular manner is what makes an AST. The top-most node is called the root node. A parent node is a node that can be a child to a root node (or other parent nodes) but itself has one or more children (hence the name, parent). In the image above, Program is the root node, :=, ... (which just means its the same as the previous one, :=) and while are the children of the Program node but parent to their respective nodes.
The first := is parent to the x node and the + node. The + node is the parent of the nodes a and b. The same applies for the second statement.
In the third statement, Block and > are children of the while node and so on…
You should have probably gotten the point by now.
Easy enough I hope?
The final tree that is formed is called the parse tree. This tree consists of nodes, each of which represent a certain programming element. The parser uses this tree to get an idea of the actual structure of the program that you’ve written. Its a very efficient way of parsing. There are again, many methods of parsing- recursive descent parsing, bottom-up parsing, top-down parsing are a few that immediately come to mind. Recursive descent is the most popular for modern languages, but there are languages that still use bottom-up and top-down approach to parsing. I won’t go into too much detail as these topics can be quite long to explain. However, each of the links above links to the Wikipedia page of that particular topic, and is a really good resource for understanding each method of parsing, as well as its advantages and disadvantages. Please do read that before proceeding.
The next step, semantic analysis, is where the source code is taken and skimmed for errors. This helps rule out all compile time errors.
The final step, code generation, is where the parsed source code is taken and converted into machine executable code. There are many ways to go about this. Some consider generating Assembly code that can be then compiled into an executable. Some prefer executing instructions through a VM that they built specifically built for the language, in which case they generate bytecode for that particular VM. More recently, it has also become quite popular for languages to generate C code, which can then be further compiled down to an executable using a C compiler. This makes it fairly easy to implement code generation, while still maintaining C-like performance. An example of a language that does this is Nim.
This is all for the first part. Please stay tuned for part 2, where we actually get into the implementation and get rid of the boring theory. I hope you found this useful!