21 minutes
Parsing Programming Languages with FParsec
Over the past few months, after learning about parser combinators in Joachim Breitner’s Haskell course, I’ve found myself needing or wanting to approach a problem by writing a parser more often than I had ever expected to. Each time, I’m amazed and satisfied by the simplicity and declarativity of parser-combinators. Not only does the functional pattern of combinator functions result in clean and readable code (except for mystical operators here and there), I find that it also makes solving problems much more interesting and much less repetitive.
My first use-case for parser-combinators outside of the aforementioned Haskell course was parsing a “pseudocode” language used by my university to teach the first few weeks of our intro CS course. The irony in this “pseudocode” was that its syntax was so strictly defined that it sort of transcended its label, and I figured I’d be able to create a parser that successfully transformed correct “pseudocode” into an AST. Once I finished writing a parser, I figured it wouldn’t be too difficult to implement an interpreter as well, and indeed this too was within reach. I enjoyed the project a lot, and in this post and the post that will follow (in which we will implement an interpreter), I wanted to share the process as a more exciting and slightly more challenging alternative to the typical JSON parsing tutorial that most parser-combinators already have.
Note that the language we will be writing a parser for in this post is a language of our own creation! If you disagree with me on how the syntax of the language should appear and feel confident enough in your understanding of parser-combinators, feel free to experiment and create a language syntax that’s more akin to your preferences. Another thing to note is that this tutorial is not quite suited to those who are unfamiliar with F# and functional programming. I would also recommend doing some more basic reading on parser-combinators before approaching this post, as I try not to spend time on what they are but rather on explaining how they can be used in a cool way. The FParsec documentation’s tutorial is very good and very helpful, as is Scott Wlaschin’s series on Understanding Parser Combinators.
Representing an AST
The goal of a parser is to give structure and meaning to the information it receives. When dealing with source code for programming languages, we often want to give structure and meaning to code by translating it into an “abstract syntax tree.” Modelling an AST in F# is pretty easy using Algebraic Data Types. We simply need to create types and constructors that correspond to every type of thing that the source code for a program in our language might contain.
A good place to start in representing the syntax of our language is values, such as integers, strings, and booleans. In this series of posts, we will make it our goal to be able to parse and eventually interpret an implementation of FizzBuzz in our language, but not much more. For this reason, we will not be concerned with parsing floating point numbers, arrays, etc. Integers, strings, and booleans will do.
type Value =
| Int of int
| Str of string
| Bool of bool
From this point on, we are “thinking upwards.” Once we have succeeded in representing one foundational structure, we can ask ourselves what kinds of structures we’ve enabled ourselves to represent by defining this one. Now that we have a way to represent values in our AST, we might want to represent expressions as well. The various forms that expressions take in most programming languages tend to be:
- Literals (values)
- Variables (which evaluate down to values)
- Operations (which chain together expressions and ultimately evaluate down to values)
Defining values gives us a way to represent both literals and variables, but to represent operations we need a way to represent operators as well. Before creating an Expression type, we will define an Operator type like so:
type Operator =
| Add // Arithmetic operators
| Sub
| Mult
| Div
| Mod
| Gt // Comparison operators
| Lt
| Gte
| Lte
| Eq // Equality operators
| Neq
| And // Boolean operators
| Or
| Sconcat // String concatenation
And now, we can define an Expr
type:
type Expr =
| Literal of Value
| Variable of name:string
// The Expr type is recursive, as operations
// can consist of expressions
| Operation of (Expr * Operator * Expr)
Believe it or not, we are almost done building types to represent our AST. All that’s left is a type to represent statements. Because all we are aiming to do with this language is parse and interpret FizzBuzz, we will avoid implementing functions in our language and will only implement some of the most basic imperative constructs:
- Printing output
- Assigning values to variables
- If/Else conditionals
- While loops
This will make for a very simple language, but we will be able to do some fun things with it once we’re done. We can represent statements by pairing them with the information they need to execute. E.g. Print requires an expression to output, Set (as in set equal to) requires a variable name and an expression to pair that name with, etc.
type Statement =
| Print of Expr
| Set of name:string * value:Expr
| If of condition:Expr * body:Block * else:Block option
| While of condition:Expr * body:Block
and Block = Statement list
We have defined a Block type in a mutually recursive fashion with the Statement type to represent lists of statements, which constitute the bodies of ifs and whiles in our AST, as well as our program as a whole.
With these types defined, we can begin writing parsers that will output values of these types, and eventually we can write a program that executes certain instructions based on data that comes in the form of an AST.
Parsing
Values
Now that we have a meaningful way to represent parsed source code, we can begin writing parsers that output values of the types in our AST representation.
I like to start by defining some convenient combinators which I use frequently that FParsec doesn’t provide on its own.
let pword s = pstring s .>> spaces
// A combinator that transforms a parser by requiring that it
// be wrapped in parentheses
let parens p = between (pword "(") (pword ")") p
The .>>
operator returns the result of the parser on the left and discards the result of the parser on the right.
Now we can slowly begin implementing parsers for all the things in our AST, beginning with Values which may be integers, strings, or booleans.
let pbool: Parser<Value, Unit> =
pword "true" <|> pword "false"
|>> function
| "true" -> Bool (true)
| "false" -> Bool (false)
// FParsec defines the pint32 parser.
// We simply cast its result to an int
// then construct an Integer Value from it
let pint: Parser<Value, Unit> = pint32 |>> int |>> Int
The <|>
operator is the “or” operator. If the parser on the left fails, it will use the parser on the right instead. If both
fail, we get a parsing error. Whether the parsed string is “true” or “false”, we can easily pass it to our value constructor to
convert it to a Bool Value. When we want to try several parsers, using the <|>
operator can become inconvenient. In those
situations, the choice
combinator exists, and operates on a list of parsers of the same type. We will see how it’s useful later,
but know that we could have just as well done choice [ pword "true" ; pword "false" ]
to achieve the same effect as using the
“or” operator here.
The |>>
operator lets us pipe the result of our parser to a function that normally operates on the type of our result,
then returns whatever value it produces into a Parser value. In this case, we have to pipe the result of parsing the string
“true” or “false” to a pattern match where we match each string to a corresponding Bool Value.
Note that when we use parser-combinators, we do not actually receive values in their original or “regular” types until we finally run our
parser with a variant of the run
function defined by FParsec. Until then, you will notice that our values are always in
the form Parser<Type, Unit>
. This is done in accordance with a functional design pattern that
Scott Wlaschin explains more carefully on his site than I can in
this post.
let pstringliteral: Parser<Value, Unit> =
// This line returns a list of chars, which we have to
// turn into a string before turning into a Str Value
pchar '\"' >>. manyCharsTill anyChar (pchar '\"')
|>> string |>> Str
// Discard the spaces at the end
.>> spaces
In order to parse any sequence of characters between a pair of quotes, we first parse the opening quote and discard it, then
use the manyCharsTill
combinator to parse all the remaining characters up until the closing quote (which is discarded).
Note that the >>.
operator is very similar to the .>>
operator. In this case, the value on the left of the operator is
discarded. A good way to think about it is that whatever is on the side of the .
is what gets kept. The operators >>
and
.>>.
also exist, for when you only want to discard twice in a row or when you want to keep two parsed values in a row, in which
case you’ll receive a tuple.
manyCharsTill
takes two parsers. The former is run until the latter is eventually matched, at which point it stops.
Now that we have defined parsers for the three types of values that might be represented in our AST, we want to create
a pvalue
parser that combines all three of them. To do this, we will use choice
:
let pvalue: Parser<Value, Unit> =
choice [
pint
pstringliteral
pbool
]
Running and Testing Parsers
As we work on our parser, it’s a good idea to test the parsers you build individually as you progress. The easiest way
to do this is to create a console application that references FParsec and uses the provided run
function on a parser
and a string. Be aware that it returns a result type, so you can define a test function like this:
let test parser strInput =
match run parser strInput with
// Assuming your parser returns something
// that can be printed. For our purposes,
// %O is usually enough.
| Success (result, _, _) -> printfn "%O" result
| Failure (error, _, _) -> printfn "%s" error
Also be aware that FParsec’s parsing errors are extremely informative and helpful!
Expressions
Going in the same order in which we defined types in our AST representation, we now want to implement expression parsers that make use of our value parser.
We can easily start by writing a parser for Literal
s, since all they require is a Value
which we can already parse.
let pliteral: Parser<Expr, Unit> = pvalue |>> Literal
Writing a variable parser is slightly more difficult, but the implementation is largely up to us. I have chosen to make
valid variable names begin with letters and contain either letters or numbers. We can implement this rule using the
many1Satisfy2
combinator which takes two predicates and parses as many characters as possible so long as
- The first char must satisfy the first predicate
- The remaining chars must satisfy the second predicate
Using this convenient combinator, we can write a variable parser that implements the rule above like so:
let pvariable: Parser<Expr, Unit> =
many1Satisfy2 (System.Char.IsLetter) (System.Char.IsLetterOrDigit)
|>> Variable
.>> spaces
The most difficult and convoluted parser to implement is the operation parser. FParsec provides some convenient utilities
to make this task less difficult, but ultimately it is still rather tedious. FParsec includes the
OperatorPrecedenceParser
class. This allows us to define a parser for each “term” of an operation, define operators themselves according to their symbol,
associativity, precedence, and how the parser should handle them, and in return get an operation parser handed to us for a given
type. This is extremely convenient, whereas implementing operator parsing on our own would’ve been rather difficult.
Let’s begin configuring an OperatorPrecedenceParser
for integers, beginning by initialising one:
let intOperatorParser = OperatorPrecedenceParser<Expr, Unit, Unit>()
Every OperatorPrecedenceParser has an expression parser that gets updated as we further define the OperatorPrecedenceParser by giving it a term parser and operators to operate on those terms. If at any point we want to parse an operation between two integers, we will use this parser.
let intExpr = intOperatorParser.ExpressionParser
We define the parser for an int “term”, that is, something that lies on either side of an integer operation (which may be an integer operation itself, which is why we refer to the intexpr parser in our term parser)
let intTerm = choice [
pint .>> spaces |>> Literal <|> pvariable
parens intExpr
]
// Assign the term parser we designed to the
// OperatorPrecedenceParser instance
intOperatorParser.TermParser <- intTerm
The OperatorPrecedenceParser
class has an AddOperator
method to which we can pass an InfixOperator
object (a class also
defined by FParsec) that takes as input the following:
- A string symbol representing the operator
- A parser to run after consuming the operator (usually a white-space parser)
- The relative precedence level of the operator as an integer
- The associativity of the operator (FParsec provides an Associativity type with constructors Left and Right)
- A function that takes two terms and returns a term
('TTerm -> 'TTerm -> 'TTerm)
It can be a little difficult to know how to create every operator we need using this constructor. One thing to note is that
the function that the InfixOperator
constructor takes, known as a mapping, will not actually apply an operation to two
terms in our case. Since we are only parsing here, we will take two Expr
s in our mapping and return an Operation
, which
works because Operation
s are Expr
s, as are variables and literals. This is, in fact, so simple that it does not need
to change much at all between operators. We can write a simple function:
let createOperation op x y = Operation (x, op, y)
Which we can later partially apply to an Operator
to define our mapping. Here is an example of how creating an InfixOperator
might work:
let addition = InfixOperator("+", spaces, 2, Associativity.Left, createOperation Add)
intOperatorParser.AddOperator(addition)
Now, when the intExpr
parser is applied to a string like 1 + 3
, the AST representation will look something like this:
Operation (
Literal (Int 1),
Add,
Literal (Int 3)
)
Which we can easily work with later in the interpreting phase.
In order to minimise code repetition, we can employ the following pattern to quickly create and add all the operators we need for integers:
type OperatorDetails = { Symbol: string;
Precedence: int;
Operator: Operator }
let intOperators = [
{Symbol = "+"; Precedence = 2; Operator = Add}
{Symbol = ">"; Precedence = 1; Operator = Gt}
{Symbol = "<"; Precedence = 1; Operator = Lt}
{Symbol = ">="; Precedence = 1; Operator = Gte}
{Symbol = "<="; Precedence = 1; Operator = Lte}
{Symbol = "=="; Precedence = 1; Operator = Eq}
{Symbol = "!="; Precedence = 1; Operator = Neq}
{Symbol = "+"; Precedence = 2; Operator = Add}
{Symbol = "-"; Precedence = 2; Operator = Sub}
{Symbol = "*"; Precedence = 3; Operator = Mult}
{Symbol = "/"; Precedence = 3; Operator = Div}
{Symbol = "%"; Precedence = 3; Operator = Mod}
]
let addOperators precedenceParser operatorTable =
operatorTable
|> List.iter (fun details ->
let operator =
InfixOperator(
details.Symbol,
spaces,
details.Precedence,
Associativity.Left,
createOperation details.Operator
)
precedenceParser.AddOperator(operator))
addOperators intOperatorParser intOperators
We create a small table of operators for integers, then define a function that iterates over the table and adds each operator
to its OperatorPrecedenceParser
. This is a relatively neat way to do what becomes rather ugly otherwise. We need to do the same
thing for strings and booleans, but they have far fewer operations defined over them, so we have slightly less to worry about.
// Define similar structures for booleans and strings
let boolOperatorParser = OperatorPrecedenceParser<Expr, Unit, Unit>()
let boolExpr = boolOperatorParser.ExpressionParser
let boolTerm = choice [
pboolval .>> spaces |>> Literal
pvariable
parens boolExpr
]
boolOperatorParser.TermParser <- boolTerm
let strOperatorParser = OperatorPrecedenceParser<Expr, Unit, Unit>()
let strExpr = strOperatorParser.ExpressionParser
// We want to make sure we can concatenate
// non-string values with strings, so we
// accept any literal or variable
let strTerm = choice [
pliteral
pvariable
intExpr
boolExpr
parens strExpr
]
strOperatorParser.TermParser <- strTerm
Now we define their operator tables and add those operators with List.iter
as we did for integers:
let boolOperators = [
{Symbol = "and"; Precedence = 2; Operator = And}
{Symbol = "or"; Precedence = 1; Operator = Or}
]
let stringOperators = [
{Symbol = "++"; Precedence = 1; Operator = Sconcat}
]
addOperators boolOperatorParser boolOperators
addOperators strOperatorParser stringOperators
Now we have a parser for every type of operation, we can define an operation parser much like we defined a value parser earlier:
let poperation = choice [
intExpr
boolExpr
strExpr
]
And now, since we have a literal parser, a variable parser, and an operation parser, we can define an expression parser:
let pexpression = choice [
poperation
pliteral
pvariable
]
We can finally move on to parsing statements.
Statements
Something to note before we get started implementing statement parsers is that the Statement
type is recursive. The
Expr
type is also recursive, but we handled that with the OperatorPrecedenceParser
class given to us by FParsec.
The Statement
class is recursive because If
blocks and While
loops must contain statements themselves to be useful.
We will find that before we can create a pstatement
parser the way we did with pvalue
and poperation
, we will need
access to a pstatement
parser for use in defining the parsers for if and while. FParsec offers a solution for this as well,
as recursive parsers aren’t awfully uncommon.
The solution is the createParserForwardedToRef
function, which takes advantage of F#’s multi-paradigm nature to simplify
the handling of recursive parsers. Of course, it’s all abstracted rather well, so we don’t need to worry too much about how
it’s implemented to make good use of it. Before we begin implementing statement parsers, we will create the pstatement
parser like so:
let pstatement, pstatementref = createParserForwardedToRef<Statement, Unit>()
Later on, when we wish to parse a statement within a statement parser, we can use pstatement
.
Let’s parse all our statements in order, starting with print:
let pprint: Parser<Statement, Unit> =
pword "print"
>>. parens pexpression
|>> Print
We’re able to use the >>.
operator in conjunction with the parens
combinator we defined earlier to parse out the expression
within a print(...)
statement and pass it to the Print
constructor.
Next we’ll implement Set
which parses variable assignment:
let pset: Parser<Statement, Unit> =
let identifier =
many1Skip2 (System.Char.IsLetter) (System.Char.IsLetterOrDigit)
.>> pword "="
identifier .>>. pexpression
|>> Set
Here we define an identifier
parser that parses valid names of variables before an equals sign (which gets discarded),
then make use of the .>>.
operator to parse both an identifier and an expression and pass the resulting tuple to the Set
constructor.
Before we move on, we’ll define a parser called pblock
which parses blocks of code between curly braces:
let pblock: Parser<Statement list, Unit> =
between (pword "{") (pword "}") (many pstatement)
Now we can define parsers for if and while:
let pif: Parser<Statement, Unit> =
let condition = pword "if" >>. pexpression
let inner = pblock
let elseBlock = pword "else" >>. pblock |> opt
pipe3 condition inner elseBlock (fun condition inner elseBlock -> If (condition, inner, elseBlock))
In the if parser, we first parse a condition by parsing out the “if” then disposing of it, and keeping the valid expression
that follows. We use pblock
to parse the body of the if, and we use FParsec’s opt
to optionally parse an else block as well.
If there is no else block, this parser returns None
which we pass to the If
constructor. The parser gets a little ugly
toward the end because we need to use pipe3
. Unfortunately there is no operator in FParsec that conveniently performs
the same task as .>>.
but returning a triple rather than a tuple. Instead, we must use pipe3
which applies 3 given
parsers in order, and then pipes their results to a function that takes 3 inputs. Here we use pipe3
to pipe the condition,
if body, and potentially existing else block to the If
constructor.
Thankfully, the while parser is much nicer:
let pwhile: Parser<Statement, Unit> =
let condition = pword "while" >>. pexpression
condition .>>. pblock
|>> While
We use the same parsing strategy we employed for parsing the condition in the if parser here, while using the same .>>.
strategy
for piping to the constructor that we used for pset
.
We have now defined all of our statement parsers, but before we can progress, we must complete one last step required by the
pstatement
parser that was forwarded to a reference cell. Similar to how we created the pvalue
and pexpression
parsers,
we assign a choice
of the statement parsers we’ve written to the pstatementref
that was created when we initialised the
pstatement
parser:
do pstatementref := choice [
pprint
pif
pwhile
pset
]
Do not worry if this step confuses you. This quote from the FParsec documentation’s JSON parsing tutorial helps demystify it a little bit:
The grammar rules for JSON lists and objects are recursive, because any list or object can contain itself any kind of JSON value. Hence, in order to write parsers for the list and object grammar rules, we need a way to refer to the parser for any kind of JSON value, even though we haven’t yet constructed this parser. Like it is so often in computing, we can solve this problem by introducing an extra indirection:
let jvalue, jvalueRef = createParserForwardedToRef<Json, unit>()
As you might have guessed from the name,
createParserForwardedToRef
creates a parser (jvalue) that forwards all invocations to the parser in a reference cell (jvalueRef). Initially, the reference cell holds a dummy parser, but since the reference cell is mutable, we can later replace the dummy parser with the actual value parser, once we have finished constructing it.
You can read the full tutorial here.
Tying it All Together
At this point we have implemented parsers for everything that appears in the set of types we created to represent our language’s AST. This means that we can write programs in the language we specified and run them through our parser to get an AST as output. Later on, we can use this generated AST to map code in the source language to F# code to implement an interpreter. For now, let’s write a testing function that will allow us to see how far we’ve come.
As explained at an earlier point, FParsec provides a run
function and variants of it for running parsers. We specifically
want to run a parser on a file, and what we want to parse is as many statements as we can. FParsec provides a runParserOnFile
function, and we have built a statement parser which we can combine with the many
combinator to achieve the effect we’re looking
for:
let parseSourceFile fpath =
match runParserOnFile (many pstatement) () fpath System.Text.Encoding.UTF8 with
| Success (result, _, _) -> printfn "%A" result
| Failure (error, _, _) -> printfn "%s" error
Now we need a file to test this on. As promised, a FizzBuzz implementation will do:
i = 1
while i <= 100 {
if i % 15 == 0 {
print("fizzbuzz")
} else {
if i % 3 == 0 {
print("fizz")
} else {
if i % 5 == 0 {
print("buzz")
} else {
print(i)
}
}
}
i = i + 1
}
Running your parser on this file should result in something like the following:
[Set ("i", Literal (Int 1));
While
(Operation (Variable "i", Lte, Literal (Int 100)),
[If
(Operation
(Operation (Variable "i", Mod, Literal (Int 15)), Eq, Literal (Int 0)),
[Print (Literal (Str "fizzbuzz"))],
Some
[If
(Operation
(Operation (Variable "i", Mod, Literal (Int 3)), Eq,
Literal (Int 0)), [Print (Literal (Str "fizz"))],
Some
[If
(Operation
(Operation (Variable "i", Mod, Literal (Int 5)), Eq,
Literal (Int 0)), [Print (Literal (Str "buzz"))],
Some [Print (Variable "i")])])]);
Set ("i", Operation (Variable "i", Add, Literal (Int 1)))])]
Excitingly enough, this is an AST that works for us. It properly represents what’s going on in the code, and next we can map it to real F# code so that it can properly function.
Conclusion
This blog post was written for F# Advent 2020 and it marks my first time participating in what I believe is a wonderful tradition. I hope that those who read it find it helpful and interesting, and that those who are itching for more will anticipate the next post in the series in which we’ll be implementing an interpreter to go along with this parser. If you’d like to see me implement the interpreter live, I regularly stream what I’m programming to Twitch with nothing but videogame music as background noise, so you can easily tune in while doing other things. I appreciate the F# community and the work they’ve put into making the language and its ecosystem better, and I’m more than happy to be contributing to it with this post!
Errata
It’s more than likely that I’ve made an error here or there throughout this post, and if you find one I encourage you to reach out to me so I can fix it and prevent future readers from being frustrated or mislead! You can contact me at ambika.eshwar@rutgers.edu or create an issue on the GitHub repository containing the source code for this tutorial, which will be available soon! Thanks to those of you who take the time to read this post, and thanks again to the wonderful F# community who make F# nicer to use by the day by organising communal efforts like F# advent!