Parsing a simple imperative language
Based on original Parsec tutorial

Last updated on January 1, 2017

This tutorial will present how to parse a subset of a simple imperative programming language called WHILE (introduced in the book “Principles of Program Analysis” by Nielson, Nielson and Hankin). It includes only a few statements and basic boolean/arithmetic expressions, which makes it nice material for a tutorial.

Imports

First let’s import the necessary libraries:

module Main (main) where

import Control.Monad (void)
import Text.Megaparsec
import Text.Megaparsec.Expr
import Text.Megaparsec.String -- input stream is of type ‘String’
import qualified Text.Megaparsec.Lexer as L

The language

The grammar for expressions is defined as follows:

a   ::= x | n | - a | a opa a
b   ::= true | false | not b | b opb b | a opr a
opa ::= + | - | * | /
opb ::= and | or
opr ::= > | <

Note that we have three groups of operators — arithmetic, boolean and relational ones.

And now the definition of statements:

S ::= x := a | skip | S1; S2 | ( S ) | if b then S1 else S2 | while b do S

We probably want to parse that into some internal representation of the language (an abstract syntax tree). Therefore we need to define the data structures for the expressions and statements.

Data structures

We need to take care of boolean and arithmetic expressions and the appropriate operators. First let’s look at the boolean expressions:

data BExpr
  = BoolConst Bool
  | Not BExpr
  | BBinary BBinOp BExpr BExpr
  | RBinary RBinOp AExpr AExpr
  deriving (Show)

Binary boolean operators:

data BBinOp = And | Or deriving (Show)

Relational operators:

data RBinOp = Greater | Less deriving (Show)

Now we define the types for arithmetic expressions:

data AExpr
  = Var String
  | IntConst Integer
  | Neg AExpr
  | ABinary ABinOp AExpr AExpr
  deriving (Show)

And arithmetic operators:

data ABinOp
  = Add
  | Subtract
  | Multiply
  | Divide
  deriving (Show)

Finally let’s take care of the statements:

data Stmt
  = Seq [Stmt]
  | Assign String AExpr
  | If BExpr Stmt Stmt
  | While BExpr Stmt
  | Skip
  deriving (Show)

Lexer

Having all the data structures we can go on with writing the code to do the actual parsing. Here we will define lexemes of our language. When writing a lexer for a language it’s always important to define what counts as whitespace and how it should be consumed. space from Text.Megaparsec.Lexer module can be helpful here:

sc :: Parser ()
sc = L.space (void spaceChar) lineCmnt blockCmnt
  where lineCmnt  = L.skipLineComment "//"
        blockCmnt = L.skipBlockComment "/*" "*/"

sc stands for “space consumer”. space takes three arguments: a parser that parses single whitespace character, a parser for line comments, and a parser for block (multi-line) comments. skipLineComment and skipBlockComment help with quickly creating parsers to consume the comments. (If our language didn’t have block comments, we could pass empty from Control.Applicative as the third argument of space.)

Next, we will use a strategy where whitespace will be consumed after every lexeme automatically, but not before it. Let’s define a wrapper to achieve this:

lexeme :: Parser a -> Parser a
lexeme = L.lexeme sc

Perfect. Now we can wrap any parser in lexeme and it will consume any trailing whitespace with sc.

Since we often want to parse some “fixed” string, let’s define one more parser called symbol. It will take a string as argument and parse this string and whitespace after it.

symbol :: String -> Parser String
symbol = L.symbol sc

With these tools we can create other useful parsers:

-- | 'parens' parses something between parenthesis.

parens :: Parser a -> Parser a
parens = between (symbol "(") (symbol ")")

-- | 'integer' parses an integer.

integer :: Parser Integer
integer = lexeme L.integer

-- | 'semi' parses a semicolon.

semi :: Parser String
semi = symbol ";"

Great. To parse various operators we can just use symbol, but reserved words and identifiers are a bit trickier. There are two points to note:

Parsers of reserved words should check that the parsed reserved word is not a prefix of an identifier.
Parsers of identifiers should check that parsed identifier is not a reserved word.

Let’s express it in code:

rword :: String -> Parser ()
rword w = string w *> notFollowedBy alphaNumChar *> sc

rws :: [String] -- list of reserved words
rws = ["if","then","else","while","do","skip","true","false","not","and","or"]

identifier :: Parser String
identifier = (lexeme . try) (p >>= check)
  where
    p       = (:) <$> letterChar <*> many alphaNumChar
    check x = if x `elem` rws
                then fail $ "keyword " ++ show x ++ " cannot be an identifier"
                else return x

identifier may seem complex, but it’s actually simple. We just parse a sequence of characters where first character is a letter and the rest is several characters where every one of them can be either letter or number. Once we have parsed such string, we check if it’s in list of reserved words, fail with informative message if it is, and return the result otherwise.

Note the use of try in identifier. This is necessary to backtrack to beginning of the identifier in cases when fail is evaluated. Otherwise things like many identifier would fail on such identifiers instead of just stopping.

And that’s it, we have just written lexer for our language, now we can start writing parser.

Parser

As already mentioned, a program in this language is simply a statement, so the main parser should basically only parse a statement. But remember to take care of initial whitespace — our parsers only get rid of whitespace after the tokens!

whileParser :: Parser Stmt
whileParser = between sc eof stmt

Now because any statement might be actually a sequence of statements separated by semicolon, we use sepBy1 to parse at least one statement. The result is a list of statements. We also allow grouping statements by parentheses, which is useful, for instance, in the while loop.

stmt :: Parser Stmt
stmt = parens stmt <|> stmtSeq

stmtSeq :: Parser Stmt
stmtSeq = f <$> sepBy1 stmt' semi
  -- if there's only one stmt return it without using ‘Seq’
  where f l = if length l == 1 then head l else Seq l

Now a single statement is quite simple, it’s either an if conditional, a while loop, an assignment or simply a skip statement. We use <|> to express choice. So a <|> b will first try parser a and if it fails (but without actually consuming any input) then parser b will be used. Note: this means that the order is important.

stmt' :: Parser Stmt
stmt' = ifStmt <|> whileStmt <|> skipStmt <|> assignStmt

If you have a parser that might fail after consuming some input, and you still want to try the next parser, you should take a look at the try combinator. For instance try p <|> q will try parsing with p and if it fails, even after consuming the input, the q parser will be used as if nothing has been consumed by p.

Now let’s define the parsers for all the possible statements. This is quite straightforward as we just use the parsers from the lexer and then use all the necessary information to create appropriate data structures.

ifStmt :: Parser Stmt
ifStmt = do
  rword "if"
  cond  <- bExpr
  rword "then"
  stmt1 <- stmt
  rword "else"
  stmt2 <- stmt
  return (If cond stmt1 stmt2)

whileStmt :: Parser Stmt
whileStmt = do
  rword "while"
  cond <- bExpr
  rword "do"
  stmt1 <- stmt
  return (While cond stmt1)

assignStmt :: Parser Stmt
assignStmt = do
  var  <- identifier
  void (symbol ":=")
  expr <- aExpr
  return (Assign var expr)

skipStmt :: Parser Stmt
skipStmt = Skip <$ rword "skip"

Expressions

What’s left is to parse the expressions. Fortunately Megaparsec provides an easy way to do that. Let’s define the arithmetic and boolean expressions:

aExpr :: Parser AExpr
aExpr = makeExprParser aTerm aOperators

bExpr :: Parser BExpr
bExpr = makeExprParser bTerm bOperators

Now we have to define the lists with operator precedence, associativity and what constructors to use in each case.

aOperators :: [[Operator Parser AExpr]]
aOperators =
  [ [Prefix (Neg <$ symbol "-") ]
  , [ InfixL (ABinary Multiply <$ symbol "*")
    , InfixL (ABinary Divide   <$ symbol "/") ]
  , [ InfixL (ABinary Add      <$ symbol "+")
    , InfixL (ABinary Subtract <$ symbol "-") ]
  ]

bOperators :: [[Operator Parser BExpr]]
bOperators =
  [ [Prefix (Not <$ rword "not") ]
  , [InfixL (BBinary And <$ rword "and")
    , InfixL (BBinary Or <$ rword "or") ]
  ]

In case of prefix operators it is enough to specify which one should be parsed and what is the associated data constructor. Infix operators are defined similarly, but there are several variants of infix constructors for various associativity. Note that the operator precedence depends only on the order of the elements in the list.

Finally we have to define the terms. In case of arithmetic expressions, it is quite simple:

aTerm :: Parser AExpr
aTerm = parens aExpr
  <|> Var      <$> identifier
  <|> IntConst <$> integer

However, the term in a boolean expression is a bit more tricky. In this case, a term can also be an expression with relational operator consisting of arithmetic expressions.

bTerm :: Parser BExpr
bTerm =  parens bExpr
  <|> (rword "true"  *> pure (BoolConst True))
  <|> (rword "false" *> pure (BoolConst False))
  <|> rExpr

Therefore we have to define a parser for relational expressions:

rExpr :: Parser BExpr
rExpr = do
  a1 <- aExpr
  op <- relation
  a2 <- aExpr
  return (RBinary op a1 a2)

relation :: Parser RBinOp
relation = (symbol ">" *> pure Greater)
  <|> (symbol "<" *> pure Less)

And that’s it. We have a quite simple parser which is able to parse a few statements and arithmetic/boolean expressions.

Notes

If you want to experiment with the parser inside GHCi, these functions might be handy:

parseTest p input applies parser p on input input and prints results.

Original Parsec tutorial in Haskell Wiki:

https://wiki.haskell.org/Parsing_a_simple_imperative_language

(Psst! Looking for source code for this tutorial? It's here.)

Parsing a simple imperative language Based on original Parsec tutorial