My book on iOS interface design, Design Teardowns: Step-by-step iOS interface design walkthroughs is now available!

Writing a lexer in Swift

Monday, November 16, 2015

In this series, we'll attempt to write parts of a compiler for a simplified language, Kaleidoscope. If you're interested in following along in C++ or Objective Caml, you can find the original tutorial here on the LLVM website.

The goal here is not to write the most efficient implementation but how we can leverage features in Swift to make ourselves easily understood.

A compiler can typically be thought of as a series of modular steps including lexical analysis, parsing, semantic analysis, optimisation and code generation.

In the lexical analysis phase, we simply try to break up the input (source code) into the small units called lexemes. These units carry specific meaning which we can categorise into groups of tokens.

Consider the following code snippet:

# Compute the x'th fibonacci number.
def fib(x)  
  if x < 3 then
    1
  else
    fib(x-1)+fib(x-2)

# This expression will compute the 40th number.
fib(40)

If you've seen any sort of programming language before, you'll immediately recognise the keywords such as def, if, then, and else. In addition, you might also notice fib and x are identifiers. Even though, they are not the same string, we see that they are names for things (namely functions and variables.)

Here's a simpler snippet:

def foo(x, y)  
  x + y * 2 + (4 + 5) / 3

foo(3, 4)

Our lexer should be able to scan through code and return a list of tokens that describe it. At this stage, it's not necessary to interpret the code in any way. We just want to identify different parts of the source and label them.

Let's jump in and try to determine what tokens we need to describe the language:

enum Token {  
    case Define
    case Identifier(String)
    case Number(Float)
    case ParensOpen
    case ParensClose
    case Comma
    case Other(String)
}

Because each group of lexemes or token has a clear purpose, we can represent them via the enum in Swift. Not all tokens are equal, however. Certain structures like Identifiers and Numbers, also carry additional information.

Next, we'll try to write a program to turn the above snippet into the following output:

[
.Define,
.Identifier("foo"),
.ParensOpen,
.Identifier("x"),
.Comma,
.Identifier("y"),
.ParensClose,
.Identifier("x"),
.Other("+"),
.Identifier("y"),
.Other("*"),
.Number(2.0),
.Other("+"),
.ParensOpen,
.Number(4.0),
.Other("+"),
.Number(5.0),
.ParensClose,
.Other("/"),
.Number(3.0),
.Identifier("foo"),
.ParensOpen,
.Number(3.0),
.Comma,
.Number(4.0),
.ParensClose,
]

One way we can try to do this is via regular expressions. Each token can be recognised by characters it is made out of. For instance, the Define token is always the characters def in that order. Identifiers, however, follow a rule such as 1) any sequence of alphanumeric characters 2) not beginning with a digit.

For each token, we'll write a regular expression capable of matching its corresponding lexeme. Next, we need to generate the enum that matches the token. We can put this all together rather concisely as an array of tuples.

The first parameter in the tuple represents the regular expression we want to match at the beginning of the context and the second parameter is a closure that will generate the relevant token enumeration.

typealias TokenGenerator = (String) -> Token?  
let tokenList: [(String, TokenGenerator)] = [  
    ("[ \t\n]", { _ in nil }),
    ("[a-zA-Z][a-zA-Z0-9]*", { $0 == "def" ? .Define : .Identifier($0) }),
    ("[0-9.]+", { (r: String) in .Number((r as NSString).floatValue) }),
    ("\\(", { _ in .ParensOpen }),
    ("\\)", { _ in .ParensClose }),
    (",", { _ in .Comma }),
]

The first rule captures whitespace which can be spaces, \t tabs or \n line breaks.

Finally, we are ready to write our lexer. We'll try to match against any of the rules in our tokenList, failing which we will just assign the character to .Other.

Here's the method of interest:

func tokenize(input: String) -> [Token] {  
    var tokens = [Token]()
    var content = input

    while (content.characters.count > 0) {
        var matched = false

        for (pattern, generator) in tokenList {
            if let m = content.match(pattern) {
                if let t = generator(m) {
                    tokens.append(t)
                }

                content = content.substringFromIndex(content.startIndex.advancedBy(m.characters.count))
                matched = true
                break
            }
        }

        if !matched {
            let index = content.startIndex.advancedBy(1)
            tokens.append(.Other(content.substringToIndex(index)))
            content = content.substringFromIndex(index)
        }
    }
    return tokens
}

That's it for now! If you want to take a look at more code, this is all on Github. In the next post, we'll look at parsing and how we can build the AST.

To keep things simple we used a String extension to provide regex functionality via func match(String) -> String?. This is simply syntactic sugar over the NSRegularExpression API in Foundation:

var expressions = [String: NSRegularExpression]()  
public extension String {  
    public func match(regex: String) -> String? {
        let expression: NSRegularExpression
        if let exists = expressions[regex] {
            expression = exists
        } else {
            expression = try! NSRegularExpression(pattern: "^\(regex)", options: [])
            expressions[regex] = expression
        }

        let range = expression.rangeOfFirstMatchInString(self, options: [], range: NSMakeRange(0, self.utf16.count))
        if range.location != NSNotFound {
            return (self as NSString).substringWithRange(range)
        }
        return nil
    }
}