My book on iOS interface design, Design Teardowns: Step-by-step iOS interface design walkthroughs is now available!

Writing a parser in Swift (I)

Saturday, November 21, 2015

This is part two to Writing a lexer in Swift. If you haven't caught that, you might get more context following that first.

In this post, we'll look at how to build an Abstract Syntax Tree from a series of tokens.

Firstly, we'll look at how to construct a grammar that describes what is allowed in the language. One of the useful tools we have is the Context-free Grammar. This allows us to describe the language as a set of rules called productions. In fact, the Swift Reference describes the language in this form!

Here's an example:

Expression -> Identifier | Number

Production rules allow us to replace symbols on the left (which by definition are non-terminal) by a sequences of symbols of the right (which may or may not be terminal symbols.) This particular rule says that an Expression symbol can be replaced by either an Identifier symbol or a Number symbol. For our purposes, these symbols are equivalent to tokens returned from our lexer.

Of course, production rules in practice are never so simple. Recall the following expression from the previous post:

x + y * 2 + (4 + 5) / 3

There's no way we can describe this using the above production rule. Let's try a slightly more exhaustive version:

Expression -> Identifier | Number | ( Expression )

This version of the production rule lets us deal with parenthesised expressions, but we're still missing operators.

Expression ->  
  PrimaryExpression Operator Expression | PrimaryExpression
PrimaryExpression -> Identifier | Number | ( Expression )

Another feature of production rules in Context-free Grammars is that they can be recursive. This allows us to describe more complex language features (especially arithmetic) which we are familiar with in common programming languages.

We now have a production rule which describes a PrimaryExpression which can be either an Identifier, a Number or a parenthesised Expression. This is in turn used to describe Expression which may or may not include Operators with PrimaryExpressions.

This is probably enough theory. Let's take a look at how we can implement this parser in code.

class Parser {  
    let tokens: [Token]
    var index = 0

    init(tokens: [Token]) {
        self.tokens = tokens
    }

    var tokensAvailable: Bool {
        return index < tokens.count
    }

    func peekCurrentToken() -> Token {
        return tokens[index]
    }

    func popCurrentToken() -> Token {
        return tokens[index++]
    }
}

We initialise the parser with a list of Tokens. This essentially sets up the initial state. Next, we have a couple of convenient helper methods which allow us to inspect tokens. The method func peekCurrentToken() returns the next available token. This is commonly known as look-ahead in parser terminology.

The method func popCurrentToken() also returns the current token but moves the index pointer one over, "consuming" the token. As we build our AST, we will generate tree nodes and consume the corresponding tokens as we go.

Let's define our nodes:

protocol ExprNode: CustomStringConvertible {  
}

Firstly, we have a base protocol ExprNode which allows us to describe nodes in Expressions. For convenience, we also inherit CustomStringConvertible so we can have nice print()s when debugging.

struct NumberNode: ExprNode {  
    let value: Float
    var description: String {
        return "NumberNode(\(value))"
    }
}

struct VariableNode: ExprNode {  
    let name: String
    var description: String {
        return "VariableNode(\(name))"
    }
}

For our first 2 nodes, we have NumberNode and VariableNode. As you might expect, NumberNode corresponds with the Token.Number and VariableNode with Token.Identifier. We could have simply used our Token enumeration as nodes in our tree but you will soon see that tree nodes also include more context than there is in our lexical tokens.

struct BinaryOpNode: ExprNode {  
    let op: String
    let lhs: ExprNode
    let rhs: ExprNode
    var description: String {
        return "BinaryOpNode(\(op), lhs: \(lhs), rhs: \(rhs))"
    }
}

The next node, BinaryOpNode, describes the infix operation. It has lhs, rhs and op as properties which correspond to the left-hand side, right-hand side and operator in a binary operation. Now it's evident why we created the ExprNode protocol. Either sides of the operation can contain NumberNode, VariableNode or BinaryOpNode (or future nodes we might introduce.)

With this, we can describe the complex expression we had at the beginning (repeated below):

x + y * 2 + (4 + 5) / 3

Here's how we parse Token.Number and Token.Identifier:

enum ParseError: ErrorType {  
    case ExpectedNumber
    case ExpectedIdentifier
}

func parseNumber() throws -> ExprNode {  
    guard case let Token.Number(value) = popCurrentToken() else {
        throw ExpectedNumber
    }
    return NumberNode(value: value)
}

func parseIdentifier() throws -> ExprNode {  
    guard case let Token.Identifier(name) = popCurrentToken() else {
        throw ParseError.ExpectedIdentifier
    }
    return VariableNode(name: name)
}

We read the current token (consuming it), check if it's what we expect, and return the corresponding node.

Next, we'll first show the declaration for parsing the Expression symbol because we need to use this recursively (also known as Recursive Descent Parsing.) We'll fill this out later:

func parseExpression() throws -> ExprNode

Now let's try to parse the parenthesised Expression:

enum ParseError: ErrorType {  
    ...
    case ExpectedCharacter(Character)
}

func parseParens() throws -> ExprNode {  
    guard case Token.ParensOpen = popCurrentToken() else {
        throw ParseError.ExpectedCharacter("(")
    }

    let exp = try parseExpression()

    guard case Token.ParensClose = popCurrentToken() else {
        throw ParseError.ExpectedCharacter(")")
    }

    return exp
}

With this, we have all the ingredients we need to write func parsePrimary():

enum ParseError: ErrorType {  
    ...
    case ExpectedExpression
}

func parsePrimary() throws -> ExprNode {  
    switch (peekCurrentToken()) {
    case .Identifier:
        return try parseIdentifier()
    case .Number:
        return try parseNumber()
    case .ParensOpen:
        return try parseParens()
    default:
        throw ParseError.ExpectedExpression
    }
}

If we create a parse tree for the expression in terms of binary operations naively, you might notice we could potentially create different versions of the tree with any of the operators at the root node. Of course, mathematically not all versions would be deemed correct. We need a way to encode our mathematical rules about which operators act first (also known as the order of operations.)

This can be done with Operator Precedence Parsing, in which we give each operator a precedence (read priority level.)

let operatorPrecedence: [String: Int] = [  
    "+": 20,
    "-": 20,
    "*": 40,
    "/": 40
]

enum ParseError: ErrorType {  
    ...
    case UndefinedOperator(String)
}

func getCurrentTokenPrecedence() throws -> Int {  
    guard tokensAvailable else {
        return -1
    }

    guard case let Token.Other(op) = peekCurrentToken() else {
        return -1
    }

    guard let precedence = operatorPrecedence[op] else {
        throw ParseError.UndefinedOperator(op)
    }

    return precedence
}

Let's look at the code and then explain how it works:

enum ParseError: ErrorType {  
    ...
    case ExpectedOperator
}

func parseBinaryOp(node: ExprNode, exprPrecedence: Int = 0) throws -> ExprNode {  
    var lhs = node
    while true {
        let tokenPrecedence = try getCurrentTokenPrecedence()
        if tokenPrecedence < exprPrecedence {
            return lhs
        }

        guard case let Token.Other(op) = popCurrentToken() else {
            throw ParseError.ExpectedOperator
        }

        var rhs = try parsePrimary()
        let nextPrecedence = try getCurrentTokenPrecedence()

        if tokenPrecedence < nextPrecedence {
            rhs = try parseBinaryOp(rhs, exprPrecedence: tokenPrecedence+1)
        }
        lhs = BinaryOpNode(op: op, lhs: lhs, rhs: rhs)
    }
}

We begin with a node which is potentially the beginning of a binary operation and repeat the following:

Read the next token;
If it's not an operator, we're done
If it has lower precedence than the current one, we're done (it would have been handled earlier in the call stack)
Consume and read the next PrimaryExpression as rhs
If the operator after has a higher precedence, recursively make a sub tree starting at rhs at the next precedence level
Form a sub-tree with the current operator

The intuition behind this is that lower precedence operators tend to appear closer to the root of the tree because higher precedence operators form sub-trees with ever higher precedence operators.

Finally, we can finish func parseExpression() and you will see it's rather trivial:

func parseExpression() throws -> ExprNode {  
    let node = try parsePrimary()
    return try parseBinaryOp(node)
}

Now we can make a call to our parser:

let parser = Parser(tokens: tokens)  
do {  
    print(try parser.parseExpression())
}
catch {  
    print(error)
}

And we should get the following output (with indention added):

BinaryOpNode(  
  +, 
  lhs: BinaryOpNode(
    +, 
    lhs: VariableNode(x), 
    rhs: BinaryOpNode(
      *, 
      lhs: VariableNode(y), 
      rhs: NumberNode(2.0)
      )
    ), 
  rhs: BinaryOpNode(
    /, 
    lhs: BinaryOpNode(
      +, 
      lhs: NumberNode(4.0), 
      rhs: NumberNode(5.0)
      ), 
    rhs: NumberNode(3.0)
    )
  )

Here's a more beautiful version of that tree with ASCII art:

        +
     ___|___
   /         \
  +          (/)
 / \         / \
x   *       +   3  
   / \     / \
  y   2   4   5

Once again here's what we were trying to parse:

x + y * 2 + (4 + 5) / 3

Because this post is getting rather long, we'll examine other nodes in the AST (such as function calls and declarations) in the next post! You can find all this code on Github. Stay tuned...