Blog

A Lexer in Swift

This is a multi-post series covering the definition, example code and basic error handling/recovery of a Lexer, Parser and AST written in Swift (although concepts can be applied to any language).
 
These posts are targeted at the beginner–as such I will only cover the basics for getting started. This post will focus entirely on the Lexer.
 
I'll also include a Swift Playground at the bottom of each post if you want to try it out for yourself. *Requires Xcode 9 (Swift 4).

Why build a Lexer?

I wanted to write a Lexer/Parser to improve my skillset. I wanted to better understand the definitions of the various components, the responsibilities of each and more importantly why I'd want to write one in the first place.

I had trouble finding simple examples that just went over the basics, so that's what I've tried to do here. Hopefully you'll find this useful as well.

What I'll cover

This post is for beginners, I'll discuss the various components of a Lexer as well as its responsibilities. I'll also briefly discuss performance and error handling.

As I mentioned above this article is targeted at beginners, so I'm going  to demonstrate building a Lexer for parsing arithmetic operations.

Definitions

Code Point
A code point defines a single unicode scalar value. This is a complex subject so for the purposes of this article, you can consider a code point equivalent to a character.

Throughout this article I will use the terms Character and Code Point interchangeably.

Token
A token is essentially a component of our grammar. A token generally represents 1 or more code points.

Lets see how the equation will be tokenised.

tokens.jpg

The responsibility of a Lexer is to generate a tokenized representation of the source string. 

Characters vs Code Points

Typically when we talk about the components of a String, we think of it being made up of Characters. However there is another level of detail here. Characters can be made up of 1 or more Code Points which are essentially UnicodeScalar values.

For the purposes of this article however these two types are equal – as such I will use the term Characters however all example code will work on UnicodeScalar's. 

Tokens

A Lexer is responsible for breaking a string into smaller components, called Token's. Lets look at a simple example:

5 + 23 * 3 = 74

Lets start by defining the components of this expression. For the purposes of this article, we'll only consider Integer values and 4 simple arithmetic operators only.

  1. Digits (0-9)
  2. Operator (+, -, *, /)
  3. Equality (=)
  4. Space

An arithmetic equation contains 2 expressions, left and right separated by an equality character. For the purposes of our Lexer, this isn't really relevant however. We simply care about the 'components' that make up our equation. 

Omissions
To keep things simple I've omitted a lot of details for truly parsing any function, i.e. parentheses, negative/floating-point/exponential numbers, etc.. I'll leave those as an exercise for the reader.
 
Whitespace
If you plan to do any form of Linting you'll require whitespace tokens, however if you plan on writing a compiler to validate the equation then whitespace is obviously not important.

Tokenizer

Now that we've defined our possible tokens, lets define our Lexer class that will consume them.

public final class Lexer {
    func tokenize(source: String) -> [TokenType] {
        var tokens: [TokenType] = []
        let chars = source.unicodeScalars 

        while let token = chars.consumeToken() {
            tokens.append(token)
        }

        return tokens
    }
}

Notice the helper function: chars.consumeToken()
Lets define that now.

extension UnicodeScalarView {
    mutating func consumeToken() -> Token? {
        return consumeSpace()
            ?? consumeDigit()
            ?? consumeOperator()
            ?? consumeEquals()
    }
}

Remember each consumer is responsible for returning either a single Token or nil. We are considered done when no more tokens are returned. i.e all consumers return nil.

Space Consumer

For the purposes of this post I'll demonstrate the first consumer, however you can checkout the Playground for the full implementation.

The following represents a simple white-space consumer.

extension UnicodeScalarView {
    mutating func consumeSpace() -> Token? {
        return consumeCharacters(where: {
            $0 == UnicodeScalar(" ")
        }).map { .space($0.characters.count) }
    }
}

Basically this function will attempt to consume a space character. If its successful, it will return a token with the number of spaces found. Otherwise it will return nil and the next consumer will be run.

Result

So lets run our equation through our Tokeniser:

let source = "5 + 23 * 3 = 74"
let tokenizer = Tokenizer()
let tokens = tokenizer.parse(source)
dump(tokens)

Which will output the following:

▿ .number: "5"
▿ .space 1
▿ .operator: "+"
▿ .space: 1
▿ .number: "23"
▿ .space: 1
▿ .operator: "*"
▿ .space: 1
▿ .number: "3"
▿ .space: 1
▿ .operator: "="
▿ .space 1
▿ .number: "74"

So there we have it, a simple Lexer for tokenizing an equation from a string.

Error Handling

There are several types of failures that can occur.

Parse Error

The first is a parse error, where we couldn't reach the end of the file but came across a character we couldn't parse. In the tokeniser above we could simply fail and return a generic error, perhaps with the last position where the failure occurred.

In my own tokeniser I found that I could simply recover. The best approach I found was to change my loop to iterate while !eof (end-of-file). When I came across a character that couldn't be parsed, I would simply consume it into an .error(Range<String.Index>) token and attempt to continue.

Unexpected Token

Using our example equation, an unexpected token would be something like sequential operators with no digits in-between.

4 + / 5 = 9

This could also be considered a parse error, however the difference here is that out Tokeniser won't fail because we can parse that into tokens. This is where our Parser comes into play, but we'll cover that in the next post.

Performance

When I initially wrote the Lexer I kept it simple (similar to the example code in this post) and made all of my tokens an enum with an associated String value.

We're working with UnicodeScalarView – which is basically just a cursor into the string, so walking through the characters has almost zero-cost. However every time I parsed a character, I'd make a copy into a String and store it in a token. 

A better approach is to consider your token data and store it more appropriately. The following would be much more performant:

enum Token {
    case digit(Int)
    case space(Int) // number of spaces
    case addition
    case subtraction
    case multiplication
    case division
    case equals
}

Our simple equation Lexer probably wouldn't see much benefit from this considering the small amount of characters we would be parsing, even in extreme cases. However when you're parsing thousands of lines or more, these kinds of refactors can be hugely beneficial.

For reference, my Tokenizer went from 40ms to 16ms for a 1000 line file by making the simple change suggested above.

Conclusion

Writing a lexer can be a daunting and time-consuming task however there's a lot to be gained by doing so. 

  • Improve understanding of Lexers/Parsers and in-part Compilers
  • Improve your understanding of Unicode constructs
  • Improve your performance tuning skills – you'll need it!

If you have any questions or just want to chat, you can find me on Twitter @shaps.