(cache) Representing syntax

When implementing small compilers or domain-specific languages, you will invariably need source locations, and to propagate them appropriately so they are handy when it is time to report an error to the user.

It is not immediately obvious how to do this well, especially when the language goes through several stages of transformation. I've arrived at a technique that I'm reasonably satisfied with, in which all ASTs are parameterised by an annotation, and derive Functor. At each stage of processing, we add further information to this annotation.

This post works through a small concrete example of this technique. It's a little heavy on code and light on insight, but hopefully really spells out a decent starting point, and will lead someone to publicly correct me with some better ideas.

A simple expression language

The language we'll implement for demonstration purposes is something like a simple calculator with infix operators and macros. It looks like this:

suc x = x + 1
add x y = x + y
mul x y = x * y
life = suc (40 + 1)
life * (suc 0) / (add 10 20)
55

The supported operators are +, *, and /. Names must start with a lowercase character. One definition is permitted per line. Lines without an = are expressions, to be computed and printed. Macro application associates like function application in Haskell.

Source positions

I suspect everyone who implements toy languages will have a module they cart around from project to project to deal with source location. Mine is pretty short and uninteresting: we need some notion of a point in a file, and some notion of a range in a file. The range structure should form a semigroup, so we can easily smash a list of consecutive ranges together.

import           Data.Semigroup

data SrcPosition = SrcPosition {
    posFile :: FilePath
  , posLine :: Integer
  , posColumn :: Integer
  } deriving (Eq, Ord, Show)

data SrcRange = SrcRange SrcPosition SrcPosition
  deriving (Eq, Ord, Show)

instance Semigroup SrcRange where
  (SrcRange a c) <> (SrcRange b d) = SrcRange a d

There are likely a bunch of modules supporting this kind of operation online, though the only one I know of is Trevor Elliott's located, which is worth a look.

Lexing expressions

A lexer / scanner takes in plain text and produces a stream of tokens with position. Its errors are in terms of characters in the source text.

While there are some standalone lexer generators out there for Haskell, the user experience varies wildly. It is reasonable to use a parser combinator library like Parsec for this task.

Let's get some imports and pragmas out of the way. We'll be needing some GHC extensions to derive our Functor instance, as well as a few other useful typeclasses. Extensions related to the derivation mechanism do not undermine the safety of our Haskell. They generate ordinary code that must typecheck under the usual rules.

{-# LANGUAGE DeriveFunctor #-}
{-# LANGUAGE DeriveFoldable #-}
{-# LANGUAGE DeriveTraversable #-}
module Lexer where

import           Data.Text

import           Text.Parsec.Text (Parser)
import qualified Text.Parsec as P

We'll also need to be able to transform Parsec's locations into our own, and a combinator to attach position information to parsers:

withPosition :: Functor f => Parser (f a) -> Parser (f SrcRange)
withPosition f =
  (\a g b -> (SrcRange (fromSrcPos a) (fromSrcPos b)) <$ g)
    <$> P.getPosition
    <*> f
    <*> P.getPosition

fromSrcPos :: P.SourcePos -> SrcPosition
fromSrcPos p =
  SrcPosition (P.sourceName p) (fromIntegral (P.sourceLine p)) (fromIntegral (P.sourceColumn p))

withPosition expects a Functor, and indeed we've got a few more to derive. Here are our possible tokens, plus room for annotations:

data Token a =
    Var a Text
  | Lit a Integer
  | Equals a
  | LParen a
  | RParen a
  | OpPlus a
  | OpMul a
  | OpDiv a
  | Newline a
  deriving (Eq, Ord, Show, Functor, Foldable, Traversable)

... and the actual lexer, which I've very helpfully left mostly-unimplemented. I try to avoid using the Monad instance when writing a lexer; most of the time you can boil the lexical grammar all the way down to Applicative, with a call to many sitting at the top.


tokens :: Parser [Token SrcRange]
tokens =
  many (withPosition token)

token :: Parser (Token ())
token =
  {- ... -}
  empty

Parsing expressions

A parser takes in a stream of tokens with position, and produces a syntax tree. Its errors are in terms of expected / unexpected tokens and parser rules.

The choice of tool here is again a little tortured if you're writing Haskell. Happy, the venerable parser generator, can do a great job, but has it's flaws. Parser combinators will work fine if you can left-factor your grammar first. Lately I've been making very productive use of the Earley library, which lets us specify our grammar directly in Haskell. It uses Earley's parsing algorithm, which is capable of parsing all CFGs, and thus avoids Template Haskell or preprocessing. Doesn't really matter for this post, because the parser will be totally elided, ha ha!

Let's define a datatype for our concrete syntax, again a Functor:

data Syntax a =
    SLit a Integer
  | SVar a Text
  | SOp a Op
  | SApp a (Syntax a) (Syntax a)
  | SDecl a Text [Text] (Syntax a)
  deriving (Eq, Ord, Show, Functor, Foldable, Traversable)

data Op =
    Plus
  | Mul
  | Div
  deriving (Eq, Ord, Show)

I strongly recommend having an intermediate datatype for your concrete syntax. If I ever edited my writing, I'd come back here and explain why. I bet this remark makes it into the cached RSS feed.

When parsing, we simply pull the SrcRange from the first and last token in each production, and combine them with (<>), storing the combined range as the annotation. For instance, we might write the SApp production like so:

sApp :: Parser (Syntax SrcRange)
sApp =
  (\e1 e2 -> App (extract e1 <> extract e2) e1 e2)
    <$> sExpr
    <*> sExpr

(I'll usually define Comonad or write a custom per-type extract function, plus a few helpers to fold the annotations over lists etc.)

Desugaring with annotations

We've now lexed and parsed our concrete syntax, and would like to transform it into a simple lambda calculus for typechecking and evaluation.

data Expr a =
    Var a Name
  | Op a Op
  | App a (Expr a) (Expr a)
  | Lam a Name (Expr a)
  deriving (Eq, Ord, Show, Functor, Foldable, Traversable)

This core language should be completely abstract to the user. It's important that any type errors be reported in terms of the concrete syntax.

To this end, we can add to our annotations as we desugar. We'll need something to wrap SrcRange with this new information. This could be a simple tuple, or we could just define yet another functor:

data Annotation a =
    ALit a Integer
  | AVar a Text
  | AApp a
  | ADec a Text
  deriving (Eq, Ord, Show, Functor, Foldable, Traversable)

... which gives our desugaring routine a type signature like this:

desugar :: Syntax a -> Map Text (Expr (Annotation a))
desugar syn =
  {- ... -}
  empty

Reporting errors

We've now very carefully tracked the provenance of every term in our program. We've been able to use these annotations at each stage to report intermediate errors.

We're now free to do whatever we need to with the core, like reduce it, perform substitutions, typecheck it, rewrite it. If we're careful to preserve the annotations at each stage and to add useful information as we go, it's hard to lose!

All we need to do is pretty-print our errors. You'll get a long way with something very basic like this:

renderAnnotation :: (a -> Text) -> Annotation a -> Text
renderAnnotation loc ann =
  case ann of
    ALit a x ->
      loc a <> " in the integer literal " <> T.pack (show x)
    AVar a x ->
      loc a <> " in the variable invocation '" <> x <> "'"
    AApp a ->
      loc a <> " in a function application"
    ADec a x ->
      loc a <> " in the declaration of '" <> x <> "'"

... though you may also get good value out of an annotated pretty-printing library, like annotated-wl-pprint.

Since we have the source location, we're also free to run back to the file and obtain the program text if need be. This might not be great if you always want to include the source text alongside your errors; maybe you could include the source text in your SrcRange datatype if that were the case.

Type errors

Type checking for this core language is a little bit out of scope for this post. This is just an aside to suggest propagating your syntax annotations into your types during checking, so we can throw errors that look like this:

data TypeError a =
    Mismatch (Type, a) (Type, a)
  deriving (Eq, Ord, Show, Functor, Foldable, Traversable)

Tracking the provenance of each type constraint like this is the style of Lennart Augustsson. If you haven't seen this talk, it's all of two minutes long, check it out.

This is nice - we know which piece of concrete syntax gave us each and every type in the program!

Other uses for the Functor

I find myself using the type parameter on expressions and concrete syntax all the time. Here are a few off the top of my head:

Storing explicit type annotations on the core during typechecking
Traversing said type annotations with a substitution map after solving unification constraints
Encoding standard library functions with () annotations, then fmap'ing to give them special (unpositioned) Annotations

Alternatives and variations

There are plenty of ways to represent syntax, propagate source information and present errors. This post is meant to present where I'm currently at, in the hope that it might give someone else a decent starting point, or attract the attention of someone who knows an even better way.

Here are a few alternatives and variations on the above. Each has a different set of trade-offs, each is well worth thinking about:

Skip the Functor and type parameter. Burn your annotation types directly into the tree
Represent your syntax tree as an open functor, and use its fixed point to manipulate terms. Use Cofree to annotate
Carry the actual source code fragment in your SrcRange datatype
Don't report any errors. Tell your users to get good