Parsing Programming Languages with FParsec

Over the past few months, after learning about parser combinators in Joachim Breitner’s Haskell course, I’ve found myself needing or wanting to approach a problem by writing a parser more often than I had ever expected to. Each time, I’m amazed and satisfied by the simplicity and declarativity of parser-combinators. Not only does the functional pattern of combinator functions result in clean and readable code (except for mystical operators here and there), I find that it also makes solving problems much more interesting and much less repetitive.

My first use-case for parser-combinators outside of the aforementioned Haskell course was parsing a “pseudocode” language used by my university to teach the first few weeks of our intro CS course. The irony in this “pseudocode” was that its syntax was so strictly defined that it sort of transcended its label, and I figured I’d be able to create a parser that successfully transformed correct “pseudocode” into an AST. Once I finished writing a parser, I figured it wouldn’t be too difficult to implement an interpreter as well, and indeed this too was within reach. I enjoyed the project a lot, and in this post and the post that will follow (in which we will implement an interpreter), I wanted to share the process as a more exciting and slightly more challenging alternative to the typical JSON parsing tutorial that most parser-combinators already have.

Note that the language we will be writing a parser for in this post is a language of our own creation! If you disagree with me on how the syntax of the language should appear and feel confident enough in your understanding of parser-combinators, feel free to experiment and create a language syntax that’s more akin to your preferences. Another thing to note is that this tutorial is not quite suited to those who are unfamiliar with F# and functional programming. I would also recommend doing some more basic reading on parser-combinators before approaching this post, as I try not to spend time on what they are but rather on explaining how they can be used in a cool way. The FParsec documentation’s tutorial is very good and very helpful, as is Scott Wlaschin’s series on Understanding Parser Combinators.

Representing an AST

The goal of a parser is to give structure and meaning to the information it receives. When dealing with source code for programming languages, we often want to give structure and meaning to code by translating it into an “abstract syntax tree.” Modelling an AST in F# is pretty easy using Algebraic Data Types. We simply need to create types and constructors that correspond to every type of thing that the source code for a program in our language might contain.

A good place to start in representing the syntax of our language is values, such as integers, strings, and booleans. In this series of posts, we will make it our goal to be able to parse and eventually interpret an implementation of FizzBuzz in our language, but not much more. For this reason, we will not be concerned with parsing floating point numbers, arrays, etc. Integers, strings, and booleans will do.

type Value =
    | Int of int
    | Str of string
    | Bool of bool

From this point on, we are “thinking upwards.” Once we have succeeded in representing one foundational structure, we can ask ourselves what kinds of structures we’ve enabled ourselves to represent by defining this one. Now that we have a way to represent values in our AST, we might want to represent expressions as well. The various forms that expressions take in most programming languages tend to be:

Literals (values)
Variables (which evaluate down to values)
Operations (which chain together expressions and ultimately evaluate down to values)

Defining values gives us a way to represent both literals and variables, but to represent operations we need a way to represent operators as well. Before creating an Expression type, we will define an Operator type like so:

type Operator =
    | Add // Arithmetic operators
    | Sub
    | Mult
    | Div
    | Mod
    | Gt // Comparison operators
    | Lt
    | Gte
    | Lte
    | Eq // Equality operators
    | Neq
    | And // Boolean operators
    | Or
    | Sconcat // String concatenation

And now, we can define an Expr type:

type Expr =
    | Literal of Value
    | Variable of name:string
    // The Expr type is recursive, as operations
    // can consist of expressions
    | Operation of (Expr * Operator * Expr)

Believe it or not, we are almost done building types to represent our AST. All that’s left is a type to represent statements. Because all we are aiming to do with this language is parse and interpret FizzBuzz, we will avoid implementing functions in our language and will only implement some of the most basic imperative constructs:

Printing output
Assigning values to variables
If/Else conditionals
While loops

This will make for a very simple language, but we will be able to do some fun things with it once we’re done. We can represent statements by pairing them with the information they need to execute. E.g. Print requires an expression to output, Set (as in set equal to) requires a variable name and an expression to pair that name with, etc.

type Statement =
    | Print of Expr
    | Set of name:string * value:Expr
    | If of condition:Expr * body:Block * else:Block option
    | While of condition:Expr * body:Block
and Block = Statement list

We have defined a Block type in a mutually recursive fashion with the Statement type to represent lists of statements, which constitute the bodies of ifs and whiles in our AST, as well as our program as a whole.

With these types defined, we can begin writing parsers that will output values of these types, and eventually we can write a program that executes certain instructions based on data that comes in the form of an AST.

Parsing

Values

Now that we have a meaningful way to represent parsed source code, we can begin writing parsers that output values of the types in our AST representation.

I like to start by defining some convenient combinators which I use frequently that FParsec doesn’t provide on its own.

let pword s = pstring s .>> spaces

// A combinator that transforms a parser by requiring that it
// be wrapped in parentheses
let parens p = between (pword "(") (pword ")") p

The .>> operator returns the result of the parser on the left and discards the result of the parser on the right.

Now we can slowly begin implementing parsers for all the things in our AST, beginning with Values which may be integers, strings, or booleans.

let pbool: Parser<Value, Unit> =
    pword "true" <|> pword "false"
    |>> function
        | "true" -> Bool (true)
        | "false" -> Bool (false)

// FParsec defines the pint32 parser.
// We simply cast its result to an int
// then construct an Integer Value from it
let pint: Parser<Value, Unit> = pint32 |>> int |>> Int

The <|> operator is the “or” operator. If the parser on the left fails, it will use the parser on the right instead. If both fail, we get a parsing error. Whether the parsed string is “true” or “false”, we can easily pass it to our value constructor to convert it to a Bool Value. When we want to try several parsers, using the <|> operator can become inconvenient. In those situations, the choice combinator exists, and operates on a list of parsers of the same type. We will see how it’s useful later, but know that we could have just as well done choice [ pword "true" ; pword "false" ] to achieve the same effect as using the “or” operator here.

The |>> operator lets us pipe the result of our parser to a function that normally operates on the type of our result, then returns whatever value it produces into a Parser value. In this case, we have to pipe the result of parsing the string “true” or “false” to a pattern match where we match each string to a corresponding Bool Value.

Note that when we use parser-combinators, we do not actually receive values in their original or “regular” types until we finally run our parser with a variant of the run function defined by FParsec. Until then, you will notice that our values are always in the form Parser<Type, Unit>. This is done in accordance with a functional design pattern that Scott Wlaschin explains more carefully on his site than I can in this post.

let pstringliteral: Parser<Value, Unit> =
    // This line returns a list of chars, which we have to
    // turn into a string before turning into a Str Value
    pchar '\"' >>. manyCharsTill anyChar (pchar '\"')
    |>> string |>> Str
    // Discard the spaces at the end
    .>> spaces

In order to parse any sequence of characters between a pair of quotes, we first parse the opening quote and discard it, then use the manyCharsTill combinator to parse all the remaining characters up until the closing quote (which is discarded).

Note that the >>. operator is very similar to the .>> operator. In this case, the value on the left of the operator is discarded. A good way to think about it is that whatever is on the side of the . is what gets kept. The operators >> and .>>. also exist, for when you only want to discard twice in a row or when you want to keep two parsed values in a row, in which case you’ll receive a tuple.

manyCharsTill takes two parsers. The former is run until the latter is eventually matched, at which point it stops.

Now that we have defined parsers for the three types of values that might be represented in our AST, we want to create a pvalue parser that combines all three of them. To do this, we will use choice:

let pvalue: Parser<Value, Unit> =
    choice [
        pint
        pstringliteral
        pbool
    ]

Running and Testing Parsers

As we work on our parser, it’s a good idea to test the parsers you build individually as you progress. The easiest way to do this is to create a console application that references FParsec and uses the provided run function on a parser and a string. Be aware that it returns a result type, so you can define a test function like this:

let test parser strInput =
    match run parser strInput with
    // Assuming your parser returns something
    // that can be printed. For our purposes,
    // %O is usually enough. 
    | Success (result, _, _) -> printfn "%O" result
    | Failure (error, _, _) -> printfn "%s" error

Also be aware that FParsec’s parsing errors are extremely informative and helpful!

Expressions

Going in the same order in which we defined types in our AST representation, we now want to implement expression parsers that make use of our value parser.

We can easily start by writing a parser for Literals, since all they require is a Value which we can already parse.

let pliteral: Parser<Expr, Unit> = pvalue |>> Literal

Writing a variable parser is slightly more difficult, but the implementation is largely up to us. I have chosen to make valid variable names begin with letters and contain either letters or numbers. We can implement this rule using the many1Satisfy2 combinator which takes two predicates and parses as many characters as possible so long as

The first char must satisfy the first predicate
The remaining chars must satisfy the second predicate

Using this convenient combinator, we can write a variable parser that implements the rule above like so:

let pvariable: Parser<Expr, Unit> =
    many1Satisfy2 (System.Char.IsLetter) (System.Char.IsLetterOrDigit)
    |>> Variable
    .>> spaces

The most difficult and convoluted parser to implement is the operation parser. FParsec provides some convenient utilities to make this task less difficult, but ultimately it is still rather tedious. FParsec includes the OperatorPrecedenceParser class. This allows us to define a parser for each “term” of an operation, define operators themselves according to their symbol, associativity, precedence, and how the parser should handle them, and in return get an operation parser handed to us for a given type. This is extremely convenient, whereas implementing operator parsing on our own would’ve been rather difficult.

Let’s begin configuring an OperatorPrecedenceParser for integers, beginning by initialising one:

let intOperatorParser = OperatorPrecedenceParser<Expr, Unit, Unit>()

Every OperatorPrecedenceParser has an expression parser that gets updated as we further define the OperatorPrecedenceParser by giving it a term parser and operators to operate on those terms. If at any point we want to parse an operation between two integers, we will use this parser.

let intExpr = intOperatorParser.ExpressionParser

We define the parser for an int “term”, that is, something that lies on either side of an integer operation (which may be an integer operation itself, which is why we refer to the intexpr parser in our term parser)

let intTerm = choice [
    pint .>> spaces |>> Literal <|> pvariable
    parens intExpr
]
// Assign the term parser we designed to the
// OperatorPrecedenceParser instance
intOperatorParser.TermParser <- intTerm

The OperatorPrecedenceParser class has an AddOperator method to which we can pass an InfixOperator object (a class also defined by FParsec) that takes as input the following:

A string symbol representing the operator
A parser to run after consuming the operator (usually a white-space parser)
The relative precedence level of the operator as an integer
The associativity of the operator (FParsec provides an Associativity type with constructors Left and Right)
A function that takes two terms and returns a term ('TTerm -> 'TTerm -> 'TTerm)

It can be a little difficult to know how to create every operator we need using this constructor. One thing to note is that the function that the InfixOperator constructor takes, known as a mapping, will not actually apply an operation to two terms in our case. Since we are only parsing here, we will take two Exprs in our mapping and return an Operation, which works because Operations are Exprs, as are variables and literals. This is, in fact, so simple that it does not need to change much at all between operators. We can write a simple function:

let createOperation op x y = Operation (x, op, y)

Which we can later partially apply to an Operator to define our mapping. Here is an example of how creating an InfixOperator might work:


let addition = InfixOperator("+", spaces, 2, Associativity.Left, createOperation Add)
intOperatorParser.AddOperator(addition)

Now, when the intExpr parser is applied to a string like 1 + 3, the AST representation will look something like this:

Operation (
    Literal (Int 1),
    Add,
    Literal (Int 3)
)

Which we can easily work with later in the interpreting phase.

In order to minimise code repetition, we can employ the following pattern to quickly create and add all the operators we need for integers:

type OperatorDetails = {  Symbol: string;
                          Precedence: int;
                          Operator: Operator }

let intOperators = [
    {Symbol = "+"; Precedence = 2; Operator = Add}
    {Symbol = ">"; Precedence = 1; Operator = Gt}
    {Symbol = "<"; Precedence = 1; Operator = Lt}
    {Symbol = ">="; Precedence = 1; Operator = Gte}
    {Symbol = "<="; Precedence = 1; Operator = Lte}
    {Symbol = "=="; Precedence = 1; Operator = Eq}
    {Symbol = "!="; Precedence = 1; Operator = Neq}
    {Symbol = "+"; Precedence = 2; Operator = Add}
    {Symbol = "-"; Precedence = 2; Operator = Sub}
    {Symbol = "*"; Precedence = 3; Operator = Mult}
    {Symbol = "/"; Precedence = 3; Operator = Div}
    {Symbol = "%"; Precedence = 3; Operator = Mod}
]

let addOperators precedenceParser operatorTable =
    operatorTable
    |> List.iter (fun details ->
        let operator =
          InfixOperator(
              details.Symbol,
              spaces,
              details.Precedence,
              Associativity.Left,
              createOperation details.Operator
          )
        precedenceParser.AddOperator(operator))


addOperators intOperatorParser intOperators

We create a small table of operators for integers, then define a function that iterates over the table and adds each operator to its OperatorPrecedenceParser. This is a relatively neat way to do what becomes rather ugly otherwise. We need to do the same thing for strings and booleans, but they have far fewer operations defined over them, so we have slightly less to worry about.

// Define similar structures for booleans and strings
let boolOperatorParser = OperatorPrecedenceParser<Expr, Unit, Unit>()
let boolExpr = boolOperatorParser.ExpressionParser
let boolTerm = choice [
    pboolval .>> spaces |>> Literal
    pvariable
    parens boolExpr
]
boolOperatorParser.TermParser <- boolTerm

let strOperatorParser = OperatorPrecedenceParser<Expr, Unit, Unit>()
let strExpr = strOperatorParser.ExpressionParser
// We want to make sure we can concatenate
// non-string values with strings, so we
// accept any literal or variable
let strTerm = choice [
    pliteral
    pvariable
    intExpr
    boolExpr
    parens strExpr 
]
strOperatorParser.TermParser <- strTerm

Now we define their operator tables and add those operators with List.iter as we did for integers:

let boolOperators = [
    {Symbol = "and"; Precedence = 2; Operator =  And}
    {Symbol = "or"; Precedence = 1; Operator = Or}
]

let stringOperators = [
    {Symbol = "++"; Precedence = 1; Operator = Sconcat}
]

addOperators boolOperatorParser boolOperators
addOperators strOperatorParser stringOperators

Now we have a parser for every type of operation, we can define an operation parser much like we defined a value parser earlier:

let poperation = choice [
    intExpr
    boolExpr
    strExpr
]

And now, since we have a literal parser, a variable parser, and an operation parser, we can define an expression parser:

let pexpression = choice [
    poperation
    pliteral
    pvariable
]

We can finally move on to parsing statements.

Statements

Something to note before we get started implementing statement parsers is that the Statement type is recursive. The Expr type is also recursive, but we handled that with the OperatorPrecedenceParser class given to us by FParsec. The Statement class is recursive because If blocks and While loops must contain statements themselves to be useful. We will find that before we can create a pstatement parser the way we did with pvalue and poperation, we will need access to a pstatement parser for use in defining the parsers for if and while. FParsec offers a solution for this as well, as recursive parsers aren’t awfully uncommon.

The solution is the createParserForwardedToRef function, which takes advantage of F#’s multi-paradigm nature to simplify the handling of recursive parsers. Of course, it’s all abstracted rather well, so we don’t need to worry too much about how it’s implemented to make good use of it. Before we begin implementing statement parsers, we will create the pstatement parser like so:

let pstatement, pstatementref = createParserForwardedToRef<Statement, Unit>()

Later on, when we wish to parse a statement within a statement parser, we can use pstatement.

Let’s parse all our statements in order, starting with print:

let pprint: Parser<Statement, Unit> =
    pword "print"
    >>. parens pexpression
    |>> Print

We’re able to use the >>. operator in conjunction with the parens combinator we defined earlier to parse out the expression within a print(...) statement and pass it to the Print constructor.

Next we’ll implement Set which parses variable assignment:

let pset: Parser<Statement, Unit> =
    let identifier =
        many1Skip2 (System.Char.IsLetter) (System.Char.IsLetterOrDigit)
        .>> pword "="

    identifier .>>. pexpression
    |>> Set

Here we define an identifier parser that parses valid names of variables before an equals sign (which gets discarded), then make use of the .>>. operator to parse both an identifier and an expression and pass the resulting tuple to the Set constructor.

Before we move on, we’ll define a parser called pblock which parses blocks of code between curly braces:

let pblock: Parser<Statement list, Unit> =
    between (pword "{") (pword "}") (many pstatement)

Now we can define parsers for if and while:

let pif: Parser<Statement, Unit> =
    let condition = pword "if" >>. pexpression
    let inner = pblock
    let elseBlock = pword "else" >>. pblock |> opt
    
    pipe3 condition inner elseBlock (fun condition inner elseBlock -> If (condition, inner, elseBlock))

In the if parser, we first parse a condition by parsing out the “if” then disposing of it, and keeping the valid expression that follows. We use pblock to parse the body of the if, and we use FParsec’s opt to optionally parse an else block as well. If there is no else block, this parser returns None which we pass to the If constructor. The parser gets a little ugly toward the end because we need to use pipe3. Unfortunately there is no operator in FParsec that conveniently performs the same task as .>>. but returning a triple rather than a tuple. Instead, we must use pipe3 which applies 3 given parsers in order, and then pipes their results to a function that takes 3 inputs. Here we use pipe3 to pipe the condition, if body, and potentially existing else block to the If constructor.

Thankfully, the while parser is much nicer:

let pwhile: Parser<Statement, Unit> =
    let condition = pword "while" >>. pexpression
    
    condition .>>. pblock
    |>> While

We use the same parsing strategy we employed for parsing the condition in the if parser here, while using the same .>>. strategy for piping to the constructor that we used for pset.

We have now defined all of our statement parsers, but before we can progress, we must complete one last step required by the pstatement parser that was forwarded to a reference cell. Similar to how we created the pvalue and pexpression parsers, we assign a choice of the statement parsers we’ve written to the pstatementref that was created when we initialised the pstatement parser:

do pstatementref := choice [
    pprint
    pif
    pwhile
    pset
]

Do not worry if this step confuses you. This quote from the FParsec documentation’s JSON parsing tutorial helps demystify it a little bit:

The grammar rules for JSON lists and objects are recursive, because any list or object can contain itself any kind of JSON value. Hence, in order to write parsers for the list and object grammar rules, we need a way to refer to the parser for any kind of JSON value, even though we haven’t yet constructed this parser. Like it is so often in computing, we can solve this problem by introducing an extra indirection:
let jvalue, jvalueRef = createParserForwardedToRef<Json, unit>()
As you might have guessed from the name, createParserForwardedToRef creates a parser (jvalue) that forwards all invocations to the parser in a reference cell (jvalueRef). Initially, the reference cell holds a dummy parser, but since the reference cell is mutable, we can later replace the dummy parser with the actual value parser, once we have finished constructing it.

You can read the full tutorial here.

Tying it All Together

At this point we have implemented parsers for everything that appears in the set of types we created to represent our language’s AST. This means that we can write programs in the language we specified and run them through our parser to get an AST as output. Later on, we can use this generated AST to map code in the source language to F# code to implement an interpreter. For now, let’s write a testing function that will allow us to see how far we’ve come.

As explained at an earlier point, FParsec provides a run function and variants of it for running parsers. We specifically want to run a parser on a file, and what we want to parse is as many statements as we can. FParsec provides a runParserOnFile function, and we have built a statement parser which we can combine with the many combinator to achieve the effect we’re looking for:

let parseSourceFile fpath =
    match runParserOnFile (many pstatement) () fpath System.Text.Encoding.UTF8 with
    | Success (result, _, _) -> printfn "%A" result
    | Failure (error, _, _) -> printfn "%s" error

Now we need a file to test this on. As promised, a FizzBuzz implementation will do:

i = 1

while i <= 100 {
    if i % 15 == 0 {
        print("fizzbuzz")
    } else {
        if i % 3 == 0 {
            print("fizz")
        } else {
            if i % 5 == 0 {
                print("buzz")
            } else {
                print(i)
            }
        }
    }

    i = i + 1
}

Running your parser on this file should result in something like the following:

[Set ("i", Literal (Int 1));
 While
   (Operation (Variable "i", Lte, Literal (Int 100)),
    [If
       (Operation
          (Operation (Variable "i", Mod, Literal (Int 15)), Eq, Literal (Int 0)),
        [Print (Literal (Str "fizzbuzz"))],
        Some
          [If
             (Operation
                (Operation (Variable "i", Mod, Literal (Int 3)), Eq,
                 Literal (Int 0)), [Print (Literal (Str "fizz"))],
              Some
                [If
                   (Operation
                      (Operation (Variable "i", Mod, Literal (Int 5)), Eq,
                       Literal (Int 0)), [Print (Literal (Str "buzz"))],
                    Some [Print (Variable "i")])])]);
     Set ("i", Operation (Variable "i", Add, Literal (Int 1)))])]

Excitingly enough, this is an AST that works for us. It properly represents what’s going on in the code, and next we can map it to real F# code so that it can properly function.

Conclusion

This blog post was written for F# Advent 2020 and it marks my first time participating in what I believe is a wonderful tradition. I hope that those who read it find it helpful and interesting, and that those who are itching for more will anticipate the next post in the series in which we’ll be implementing an interpreter to go along with this parser. If you’d like to see me implement the interpreter live, I regularly stream what I’m programming to Twitch with nothing but videogame music as background noise, so you can easily tune in while doing other things. I appreciate the F# community and the work they’ve put into making the language and its ecosystem better, and I’m more than happy to be contributing to it with this post!

Errata

It’s more than likely that I’ve made an error here or there throughout this post, and if you find one I encourage you to reach out to me so I can fix it and prevent future readers from being frustrated or mislead! You can contact me at ambika.eshwar@rutgers.edu or create an issue on the GitHub repository containing the source code for this tutorial, which will be available soon! Thanks to those of you who take the time to read this post, and thanks again to the wonderful F# community who make F# nicer to use by the day by organising communal efforts like F# advent!