My book on iOS interface design, Design Teardowns: Step-by-step iOS interface design walkthroughs is now available!

In this series, we'll attempt to write parts of a compiler for a simplified language, Kaleidoscope. If you're interested in following along in C++ or Objective Caml, you can find the original tutorial here on the LLVM website.

The goal here is not to write the most efficient implementation but how we can leverage features in Swift to make ourselves easily understood.

A compiler can typically be thought of as a series of modular steps including lexical analysis, parsing, semantic analysis, optimisation and code generation.

In the lexical analysis phase, we simply try to break up the input (source code) into the small units called lexemes. These units carry specific meaning which we can categorise into groups of tokens.

Consider the following code snippet:

``````# Compute the x'th fibonacci number.
def fib(x)
if x < 3 then
1
else
fib(x-1)+fib(x-2)

# This expression will compute the 40th number.
fib(40)
``````

If you've seen any sort of programming language before, you'll immediately recognise the keywords such as `def`, `if`, `then`, and `else`. In addition, you might also notice `fib` and `x` are identifiers. Even though, they are not the same string, we see that they are names for things (namely functions and variables.)

Here's a simpler snippet:

``````def foo(x, y)
x + y * 2 + (4 + 5) / 3

foo(3, 4)
``````

Our lexer should be able to scan through code and return a list of tokens that describe it. At this stage, it's not necessary to interpret the code in any way. We just want to identify different parts of the source and label them.

Let's jump in and try to determine what tokens we need to describe the language:

``````enum Token {
case Define
case Identifier(String)
case Number(Float)
case ParensOpen
case ParensClose
case Comma
case Other(String)
}
``````

Because each group of lexemes or token has a clear purpose, we can represent them via the `enum` in Swift. Not all tokens are equal, however. Certain structures like `Identifier`s and `Number`s, also carry additional information.

Next, we'll try to write a program to turn the above snippet into the following output:

``````[
.Define,
.Identifier("foo"),
.ParensOpen,
.Identifier("x"),
.Comma,
.Identifier("y"),
.ParensClose,
.Identifier("x"),
.Other("+"),
.Identifier("y"),
.Other("*"),
.Number(2.0),
.Other("+"),
.ParensOpen,
.Number(4.0),
.Other("+"),
.Number(5.0),
.ParensClose,
.Other("/"),
.Number(3.0),
.Identifier("foo"),
.ParensOpen,
.Number(3.0),
.Comma,
.Number(4.0),
.ParensClose,
]
``````

One way we can try to do this is via regular expressions. Each token can be recognised by characters it is made out of. For instance, the `Define` token is always the characters `def` in that order. `Identifier`s, however, follow a rule such as 1) any sequence of alphanumeric characters 2) not beginning with a digit.

For each token, we'll write a regular expression capable of matching its corresponding lexeme. Next, we need to generate the `enum` that matches the token. We can put this all together rather concisely as an array of tuples.

The first parameter in the tuple represents the regular expression we want to match at the beginning of the context and the second parameter is a closure that will generate the relevant token enumeration.

``````typealias TokenGenerator = (String) -> Token?
let tokenList: [(String, TokenGenerator)] = [
("[ \t\n]", { _ in nil }),
("[a-zA-Z][a-zA-Z0-9]*", { \$0 == "def" ? .Define : .Identifier(\$0) }),
("[0-9.]+", { (r: String) in .Number((r as NSString).floatValue) }),
("\\(", { _ in .ParensOpen }),
("\\)", { _ in .ParensClose }),
(",", { _ in .Comma }),
]
``````

The first rule captures whitespace which can be spaces, `\t` tabs or `\n` line breaks.

Finally, we are ready to write our lexer. We'll try to match against any of the rules in our `tokenList`, failing which we will just assign the character to `.Other`.

Here's the method of interest:

``````func tokenize(input: String) -> [Token] {
var tokens = [Token]()
var content = input

while (content.characters.count > 0) {
var matched = false

for (pattern, generator) in tokenList {
if let m = content.match(pattern) {
if let t = generator(m) {
tokens.append(t)
}

matched = true
break
}
}

if !matched {
tokens.append(.Other(content.substringToIndex(index)))
content = content.substringFromIndex(index)
}
}
}
``````

That's it for now! If you want to take a look at more code, this is all on Github. In the next post, we'll look at parsing and how we can build the AST.

To keep things simple we used a `String` extension to provide regex functionality via `func match(String) -> String?`. This is simply syntactic sugar over the `NSRegularExpression` API in `Foundation`:

``````var expressions = [String: NSRegularExpression]()
public extension String {
public func match(regex: String) -> String? {
let expression: NSRegularExpression
if let exists = expressions[regex] {
expression = exists
} else {
expression = try! NSRegularExpression(pattern: "^\(regex)", options: [])
expressions[regex] = expression
}

let range = expression.rangeOfFirstMatchInString(self, options: [], range: NSMakeRange(0, self.utf16.count))
if range.location != NSNotFound {
return (self as NSString).substringWithRange(range)
}
return nil
}
}
`````` Matthew Cheok
Mobile Development & Interaction Design