Some progressive Haskell hackers may wish to switch from Parsec to Megaparsec. This tutorial explains the practical differences between the two libraries that you will need to address if you choose to undertake the switch. Remember, all the functionality available in Parsec is available in Megaparsec and often in a better form.
- Imports
- Renamed things
- Removed things
- Completely changed things
- Other
- Character parsing
- Expression parsing
- What happened to
Text.Parsec.Token
? - What’s next?
Imports
You’ll mainly need to replace “Parsec” part in your imports with “Megaparsec”. That’s pretty simple. Typical import section of module that uses Megaparsec looks like this:
-- this module contains commonly useful tools:
import Text.Megaparsec
-- this module depends on type of data you want to parse, you only need to
-- import one of these:
import Text.Megaparsec.String -- if you parse ‘String’
import Text.Megaparsec.ByteString -- if you parse strict ‘ByteString’
import Text.Megaparsec.ByteString.Lazy -- if you parse lazy ‘ByteString’
import Text.Megaparsec.Text -- if you parse strict ‘Text’
import Text.Megaparsec.Text.Lazy -- if you parse lazy ‘Text’
-- if you need to parse permutation phrases:
import Text.Megaparsec.Perm
-- if you need to parse expressions:
import Text.Megaparsec.Expr
-- if you need to parse languages:
import qualified Text.Megaparsec.Lexer as L
So, the only noticeable difference that Megaparsec has no Text.Megaparsec.Token
module which is replaced with Text.Megaparsec.Lexer
, see about this in section “What happened to Text.Parsec.Token
”.
Renamed things
Megaparsec introduces a more consistent naming scheme, so some things are called differently, but renaming functions is a very easy task, you don’t need to think. Here are renamed items:
many1
→some
(re-exported fromControl.Applicative
)skipMany1
→skipSome
tokenPrim
→token
optionMaybe
→optional
(re-exported fromControl.Applicative
)permute
→makePermParser
buildExpressionParser
→makeExprParser
Character parsing:
alphaNum
→alphaNumChar
digit
→digitChar
endOfLine
→eol
hexDigit
→hexDigitChar
letter
→letterChar
lower
→lowerChar
octDigit
→octDigitChar
space
→spaceChar
†spaces
→space
†upper
→upperChar
† — pay attention to these, since space
parses many spaceChar
s, including zero, if you write something like many space
, your parser will hang. So be careful to replace many space
with either many spaceChar
or spaces
.
Removed things
Parsec also has many names for the same or similar things. Megaparsec usually has one function per task that does its job well. Here are the items that were removed in Megaparsec and reasons of their removal:
parseFromFile
— from file and then parsing its contents is trivial for every instance ofStream
and this function provides no way to use newer methods for running a parser, such asrunParser'
.getState
,putState
,modifyState
— ad-hoc backtracking user state has been eliminated.unexpected
,token
andtokens
, now there is a bit different versions of these functions under the same name.Reply
andConsumed
are not public data types anymore, because they are low-level implementation details.runPT
andrunP
were essentially synonyms forrunParserT
andrunParser
respectively.chainl
,chainl1
,chainr
, andchainr1
— useText.Megaparsec.Expr
instead.
Completely changed things
In Megaparsec 5 the modules Text.Megaparsec.Pos
and Text.Megaparsec.Error
are completely different from those found in Parsec and Megaparsec 4. Take some time to look at documentation of the modules if your use-case requires operations on error messages or positions. You may like the fact that we have well-typed and extensible error messages now.
Other
The
Stream
type class now haveupdatePos
method that gives precise control over advancing of textual positions during parsing.Note that argument order of
label
has been flipped (the label itself goes first now), so you can write now:myParser = label "my parser" $ …
.Don’t use the
label ""
(or the… <?> ""
) idiom to “hide” some “expected” tokens from error messages, usehidden
.New
token
parser is more powerful, its first argument provides full control over reported error message while its second argument allows to specify how to report missing token in case of empty input stream.Now
tokens
parser allows to control how tokens are compared (yes, we have case-insensitivestring
calledstring'
).The
unexpected
parser allows to specify precisely what is unexpected in well-typed manner.Tab width is not hard-coded anymore, use
getTabWidth
andsetTabWidth
to change it. Default tab width isdefaultTabWidth
.Now you can reliably test error messages, equality for them is now defined properly (in Parsec
Expect "foo"
is equal toExpect "bar"
), error messages are also well-typed and customizeable.To render a error message, apply
parseErrorPretty
on it.count' m n p
allows you to parse fromm
ton
occurrences ofp
.Now you have
someTill
andeitherP
out of the box.token
-based combinators likestring
andstring'
backtrack by default, so it’s not necessary to usetry
with them (beginning from4.4.0
). This feature does not affect performance.The new
failure
combinator allows to fail with an arbitrary error message, it even allows to use your own data types.
Character parsing
New character parsers in Text.Megaparsec.Char
may be useful if you work with Unicode:
asciiChar
charCategory
controlChar
latin1Char
markChar
numberChar
printChar
punctuationChar
separatorChar
symbolChar
Ever wanted to have case-insensitive character parsers? Here you go:
char'
oneOf'
noneOf'
string'
Expression parsing
makeExprParser
has flipped order of arguments: term parser first, operator table second. To specify associativity of infix operators you use one of the three Operator
constructors:
InfixN
— non-associative infixInfixL
— left-associative infixInfixR
— right-associative infix
What happened to Text.Parsec.Token
?
That module was extremely inflexible and thus it has been eliminated. In Megaparsec you have Text.Megaparsec.Lexer
instead, which doesn’t impose anything on user but provides useful helpers. The module can also parse indentation-sensitive languages.
Let’s quickly describe how you go about writing your lexer with Text.Megaparsec.Lexer
. First, you should import the module qualified, we will use L
as its synonym here.
White space
Start writing your lexer by defining what counts as white space in your language. space
, skipLineComment
, and skipBlockComment
can be helpful:
sc :: Parser () -- ‘sc’ stands for “space consumer”
sc = L.space (void spaceChar) lineComment blockComment
where lineComment = L.skipLineComment "//"
blockComment = L.skipBlockComment "/*" "*/"
This is generally called space consumer, often you’ll need only one space consumer, but you can define as many of them as you want. Note that this new module allows you avoid consuming newline characters automatically, just use something different than void spaceChar
as first argument of space
. Even better, you can control what white space is on per-lexeme basis:
lexeme :: Parser a -> Parser a
lexeme = L.lexeme sc
symbol :: String -> Parser String
symbol = L.symbol sc
Monad transformers
Note that all tools in Megaparsec work with any instance of MonadParsec
. All commonly useful monad transformers like StateT
and WriterT
are instances of MonadParsec
out of the box. For example, what if you want to collect contents of comments, (say, they are documentation strings of a sort), you may want to have backtracking user state were you put last encountered comment satisfying some criteria, and then when you parse function definition you can check the state and attach doc-string to your parsed function. It’s all possible and easy with Megaparsec:
import Control.Monad.State.Lazy
…
type MyParser = StateT String Parser
skipLineComment' :: MyParser ()
skipLineComment' = …
skipBlockComment' :: MyParser ()
skipBlockComment' = …
sc :: MyParser ()
sc = space (void spaceChar) skipLineComment' skipBlockComment'
Indentation-sensitive languages
Parsing of indentation-sensitive language deserves its own tutorial, but let’s take a look at basic tools upon which you can build. First of all you should work with space consumer that doesn’t eat newlines automatically. This means you’ll need to pick them up manually.
Main helper is called indentGuard
. It takes parser that will be used to consume white space (indentation) and a predicate of type Int -> Bool
. If after running the given parser column number does not satisfy given predicate, the parser fails with message “incorrect indentation”, otherwise it returns current column number.
In simple cases you can explicitly pass around value returned by indentGuard
, i.e. current level of indentation. If you prefer to preserve some sort of state you can achieve backtracking state combining StateT
and ParsecT
, like this:
StateT Int Parser a
Here we have state of type Int
. You can use get
and put
as usual, although it may be better to write modified version of indentGuard
that could get current indentation level (indentation level on previous line), then consume indentation of current line, perform necessary checks, and put new level of indentation.
Later update: now we have full support for indentation-sensitive parsing, see nonIndented
, indentBlock
, and lineFold
in the Text.Megaparsec.Lexer
module.
Character and string literals
Parsing of string and character literals is done a bit differently than in Parsec. You have the single helper charLiteral
, which parses a character literal. It does not parse surrounding quotes, because different languages may quote character literals differently. Purpose of this parser is to help with parsing of conventional escape sequences (literal character is parsed according to rules defined in Haskell report).
charLiteral :: Parser Char
charLiteral = char '\'' *> charLiteral <* char '\''
Use charLiteral
to parse string literals. This is simplified version that will accept plain (not escaped) newlines in string literals (it’s easy to make it conform to Haskell syntax, this is left as an exercise for the reader):
stringLiteral :: Parser String
stringLiteral = char '"' >> manyTill L.charLiteral (char '"')
I should note that in charLiteral
we use built-in support for parsing of all the tricky combinations of characters. On the other hand Parsec re-implements the whole thing. Given that it mostly has no tests at all, I cannot tell for sure that it works.
Numbers
Parsing of numbers is easy:
integer :: Parser Integer
integer = lexeme L.integer
float :: Parser Double
float = lexeme L.float
number :: Parser Scientific
number lexeme L.number -- similar to ‘naturalOrFloat’ in Parsec
Note that Megaparsec internally uses standard Haskell functions to parse floating point numbers, thus no precision loss is possible (and it’s tested). On the other hand, Parsec again re-implements the whole thing. Approach taken by Parsec authors is to parse the numbers one by one and then re-create the floating point number by means of floating point arithmetic. Any professional knows that this is not possible and the only way to parse floating point number is via bit-level manipulation (it’s usually done on OS level, in C libraries). Of course results produced by Parsec built-in parser for floating point numbers are incorrect. This is a known bug now, but it’s been a long time till we “discovered” it, because again, Parsec has no test suite. (Update: it took one year but Parsec’s maintainer has recently merged a pull request that seems to fix that and released Parsec 3.1.11.)
Hexadecimal and octal numbers do not parse “0x” or “0o” prefixes, because different languages may have other prefixes for this sort of numbers. We should parse the prefixes manually:
hexadecimal :: Parser Integer
hexadecimal = lexeme $ char '0' >> char' 'x' >> L.hexadecimal
octal :: Parser Integer
octal = lexeme $ char '0' >> char' 'o' >> L.octal
Since Haskell report says nothing about sign in numeric literals, basic parsers like integer
do not parse sign. You can easily create parsers for signed numbers with help of signed
:
signedInteger :: Parser Integer
signedInteger = L.signed sc integer
signedFloat :: Parser Double
signedFloat = L.signed sc float
signedNumber :: Parser Scientific
signedNumber = L.signed sc number
And that’s it, shiny and new, Text.Megaparsec.Lexer
is at your service, now you can implement anything you want without the need to copy and edit entire Text.Parsec.Token
module (people had to do it sometimes, you know).
What’s next?
Changes you may want to perform may be more fundamental than those described here. For example, previously you may have to use a workaround because Text.Parsec.Token
was not sufficiently flexible. Now you can replace it with a proper solution. If you want to use the full potential of Megaparsec, take time to read about its features, they can help you improve your parsers.