My book on iOS interface design, Design Teardowns: Step-by-step iOS interface design walkthroughs is now available!
Writing a parser in Swift (I)
This is part two to Writing a lexer in Swift. If you haven't caught that, you might get more context following that first.
In this post, we'll look at how to build an Abstract Syntax Tree from a series of tokens.
Firstly, we'll look at how to construct a grammar that describes what is allowed in the language. One of the useful tools we have is the Context-free Grammar. This allows us to describe the language as a set of rules called productions. In fact, the Swift Reference describes the language in this form!
Here's an example:
Expression -> Identifier | Number
Production rules allow us to replace symbols on the left (which by definition are non-terminal) by a sequences of symbols of the right (which may or may not be terminal symbols.) This particular rule says that an Expression
symbol can be replaced by either an Identifier
symbol or a Number
symbol. For our purposes, these symbols are equivalent to tokens returned from our lexer.
Of course, production rules in practice are never so simple. Recall the following expression from the previous post:
x + y * 2 + (4 + 5) / 3
There's no way we can describe this using the above production rule. Let's try a slightly more exhaustive version:
Expression -> Identifier | Number | ( Expression )
This version of the production rule lets us deal with parenthesised expressions, but we're still missing operators.
Expression ->
PrimaryExpression Operator Expression | PrimaryExpression
PrimaryExpression -> Identifier | Number | ( Expression )
Another feature of production rules in Context-free Grammars is that they can be recursive. This allows us to describe more complex language features (especially arithmetic) which we are familiar with in common programming languages.
We now have a production rule which describes a PrimaryExpression
which can be either an Identifier
, a Number
or a parenthesised Expression
. This is in turn used to describe Expression
which may or may not include Operator
s with PrimaryExpression
s.
This is probably enough theory. Let's take a look at how we can implement this parser in code.
class Parser {
let tokens: [Token]
var index = 0
init(tokens: [Token]) {
self.tokens = tokens
}
var tokensAvailable: Bool {
return index < tokens.count
}
func peekCurrentToken() -> Token {
return tokens[index]
}
func popCurrentToken() -> Token {
return tokens[index++]
}
}
We initialise the parser with a list of Token
s. This essentially sets up the initial state. Next, we have a couple of convenient helper methods which allow us to inspect tokens. The method func peekCurrentToken()
returns the next available token. This is commonly known as look-ahead in parser terminology.
The method func popCurrentToken()
also returns the current token but moves the index
pointer one over, "consuming" the token. As we build our AST, we will generate tree nodes and consume the corresponding tokens as we go.
Let's define our nodes:
protocol ExprNode: CustomStringConvertible {
}
Firstly, we have a base protocol ExprNode
which allows us to describe nodes in Expression
s. For convenience, we also inherit CustomStringConvertible
so we can have nice print()
s when debugging.
struct NumberNode: ExprNode {
let value: Float
var description: String {
return "NumberNode(\(value))"
}
}
struct VariableNode: ExprNode {
let name: String
var description: String {
return "VariableNode(\(name))"
}
}
For our first 2 nodes, we have NumberNode
and VariableNode
. As you might expect, NumberNode
corresponds with the Token.Number
and VariableNode
with Token.Identifier
. We could have simply used our Token
enumeration as nodes in our tree but you will soon see that tree nodes also include more context than there is in our lexical tokens.
struct BinaryOpNode: ExprNode {
let op: String
let lhs: ExprNode
let rhs: ExprNode
var description: String {
return "BinaryOpNode(\(op), lhs: \(lhs), rhs: \(rhs))"
}
}
The next node, BinaryOpNode
, describes the infix operation. It has lhs
, rhs
and op
as properties which correspond to the left-hand side, right-hand side and operator in a binary operation. Now it's evident why we created the ExprNode
protocol. Either sides of the operation can contain NumberNode
, VariableNode
or BinaryOpNode
(or future nodes we might introduce.)
With this, we can describe the complex expression we had at the beginning (repeated below):
x + y * 2 + (4 + 5) / 3
Here's how we parse Token.Number
and Token.Identifier
:
enum ParseError: ErrorType {
case ExpectedNumber
case ExpectedIdentifier
}
func parseNumber() throws -> ExprNode {
guard case let Token.Number(value) = popCurrentToken() else {
throw ExpectedNumber
}
return NumberNode(value: value)
}
func parseIdentifier() throws -> ExprNode {
guard case let Token.Identifier(name) = popCurrentToken() else {
throw ParseError.ExpectedIdentifier
}
return VariableNode(name: name)
}
We read the current token (consuming it), check if it's what we expect, and return the corresponding node.
Next, we'll first show the declaration for parsing the Expression
symbol because we need to use this recursively (also known as Recursive Descent Parsing.) We'll fill this out later:
func parseExpression() throws -> ExprNode
Now let's try to parse the parenthesised Expression
:
enum ParseError: ErrorType {
...
case ExpectedCharacter(Character)
}
func parseParens() throws -> ExprNode {
guard case Token.ParensOpen = popCurrentToken() else {
throw ParseError.ExpectedCharacter("(")
}
let exp = try parseExpression()
guard case Token.ParensClose = popCurrentToken() else {
throw ParseError.ExpectedCharacter(")")
}
return exp
}
With this, we have all the ingredients we need to write func parsePrimary()
:
enum ParseError: ErrorType {
...
case ExpectedExpression
}
func parsePrimary() throws -> ExprNode {
switch (peekCurrentToken()) {
case .Identifier:
return try parseIdentifier()
case .Number:
return try parseNumber()
case .ParensOpen:
return try parseParens()
default:
throw ParseError.ExpectedExpression
}
}
If we create a parse tree for the expression in terms of binary operations naively, you might notice we could potentially create different versions of the tree with any of the operators at the root node. Of course, mathematically not all versions would be deemed correct. We need a way to encode our mathematical rules about which operators act first (also known as the order of operations.)
This can be done with Operator Precedence Parsing, in which we give each operator a precedence (read priority level.)
let operatorPrecedence: [String: Int] = [
"+": 20,
"-": 20,
"*": 40,
"/": 40
]
enum ParseError: ErrorType {
...
case UndefinedOperator(String)
}
func getCurrentTokenPrecedence() throws -> Int {
guard tokensAvailable else {
return -1
}
guard case let Token.Other(op) = peekCurrentToken() else {
return -1
}
guard let precedence = operatorPrecedence[op] else {
throw ParseError.UndefinedOperator(op)
}
return precedence
}
Let's look at the code and then explain how it works:
enum ParseError: ErrorType {
...
case ExpectedOperator
}
func parseBinaryOp(node: ExprNode, exprPrecedence: Int = 0) throws -> ExprNode {
var lhs = node
while true {
let tokenPrecedence = try getCurrentTokenPrecedence()
if tokenPrecedence < exprPrecedence {
return lhs
}
guard case let Token.Other(op) = popCurrentToken() else {
throw ParseError.ExpectedOperator
}
var rhs = try parsePrimary()
let nextPrecedence = try getCurrentTokenPrecedence()
if tokenPrecedence < nextPrecedence {
rhs = try parseBinaryOp(rhs, exprPrecedence: tokenPrecedence+1)
}
lhs = BinaryOpNode(op: op, lhs: lhs, rhs: rhs)
}
}
We begin with a node which is potentially the beginning of a binary operation and repeat the following:
- Read the next token;
- If it's not an operator, we're done
- If it has lower precedence than the current one, we're done (it would have been handled earlier in the call stack)
- Consume and read the next
PrimaryExpression
asrhs
- If the operator after has a higher precedence, recursively make a sub tree starting at
rhs
at the next precedence level - Form a sub-tree with the current operator
The intuition behind this is that lower precedence operators tend to appear closer to the root of the tree because higher precedence operators form sub-trees with ever higher precedence operators.
Finally, we can finish func parseExpression()
and you will see it's rather trivial:
func parseExpression() throws -> ExprNode {
let node = try parsePrimary()
return try parseBinaryOp(node)
}
Now we can make a call to our parser:
let parser = Parser(tokens: tokens)
do {
print(try parser.parseExpression())
}
catch {
print(error)
}
And we should get the following output (with indention added):
BinaryOpNode(
+,
lhs: BinaryOpNode(
+,
lhs: VariableNode(x),
rhs: BinaryOpNode(
*,
lhs: VariableNode(y),
rhs: NumberNode(2.0)
)
),
rhs: BinaryOpNode(
/,
lhs: BinaryOpNode(
+,
lhs: NumberNode(4.0),
rhs: NumberNode(5.0)
),
rhs: NumberNode(3.0)
)
)
Here's a more beautiful version of that tree with ASCII art:
+
___|___
/ \
+ (/)
/ \ / \
x * + 3
/ \ / \
y 2 4 5
Once again here's what we were trying to parse:
x + y * 2 + (4 + 5) / 3
Because this post is getting rather long, we'll examine other nodes in the AST (such as function calls and declarations) in the next post! You can find all this code on Github. Stay tuned...