Logo of AppSignal

Menu

Ruby Magic

Brewing our own Template Lexer in Ruby

Benedikt Benedikt Deicke on

Put on your scuba diving suite and pack your stencils, we’re diving into Templates today!

Most software that renders web pages or generates emails uses templating to embed variable data into text documents. The main structure of the document is often set up in a static template with placeholders for the data. The variable data, like user names or web page contents, replace the placeholders while rendering the page.

For our dive into templating, we’ll implement a subset of Mustache, a templating language that’s available in many programming languages. In this episode, we’ll investigate different ways of templating. We’ll start out looking at string concatenation, and end up writing our own lexer to allow for more complex templates.

Using Native String Interpolation

Let’s start with a minimal example. Our application needs a welcome message that happens to include a project name. The quickest way to do this is by using Ruby’s built-in string interpolation feature.

1
2
3
name = "Ruby Magic"
template = "Welcome to #{name}"
# => Welcome to Ruby Magic

Great! That was doable. However, what if we want to reuse the template for multiple occasions, or allow our users to update the template?

The interpolation evaluates immediately. We can’t reuse the template (unless we redefine it—in a loop, for instance) and we can’t store the Welcome to #{name} template in a database and populate it later without using the potentially dangerous eval function.

Luckily, Ruby has a different way of interpolating strings: Kernel#sprintf or String#%. These methods allow us to get an interpolated string without changing the template itself. This way, we can reuse the same template multiple times. It also doesn’t allow execution of arbitrary Ruby code. Let’s use it.

1
2
3
4
5
6
7
8
name = "Ruby Magic"
template = "Welcome to %{name}"

sprintf(template, name: name)
# => "Welcome to Ruby Magic"

template % { name: name }
# => "Welcome to Ruby Magic"

The Regexp Approach to Templating

While the above solution works, it’s not fool-proof, and it exposes more functionality than we usually want to. Let’s look at an example:

1
2
3
4
5
name = "Ruby Magic"
template = "Welcome to %d"

sprintf(template, name: name)
# => TypeError (can't convert Hash into Integer)

Both Kernel#sprintf and String#% allow special syntax to handle different types of data. Not all of them are compatible with the data we pass. In this example, the template expects to format a number but gets passed a Hash, producing a TypeError.

But we have more power tools in our shed: we can implement our own interpolation using regular expressions. Using regular expressions allows us to define a custom syntax, like a Mustache/Handlebars inspired style.

1
2
3
4
5
6
name = "Ruby Magic"
template = "Welcome to {{name}}"
assigns = { "name" => name }

template.gsub(/{{(\w+)}}/) { assigns[$1] }
# => Welcome to Ruby Magic

We use String#gsub to replace all placeholders (words in double curly braces) with their value in the assigns hash. If there is no corresponding value, this method removes the placeholder without inserting anything.

Replacing placeholders in a string like this is a viable solution for a string with a couple of placeholders. However, once things get a bit more complicated, we quickly run into problems.

Let’s say we need to have conditionals in the template. The result should be different based on the value of a variable.

1
2
3
4
5
6
7
8
9
Welcome to {{name}}!

{{#if subscribed}}
  Thank you for subscribing to our mailing list.
{{else}}
  Please sign up for our mailing list to be notified about new articles!
{{/if}}

Your friends at {{company_name}}

Regular expressions can’t smoothly handle this use case. If you try hard enough, you can probably still hack something together, but at this point, it’s better to build a proper templating language.

Building a Templating Language

Implementing a templating language is similar to implementing other programming languages. Just like a scripting language, a template language needs three components: A lexer, a parser, and an interpreter. We’ll look at these, one by one.

Lexer

The first task we need to tackle is called tokenization, or lexical analysis. The process is very similar to identifying word categories in natural languages.

Take an example like Ruby is a lovely language. The sentence consists of five words of different categories. To identify what category they are, you’d take a dictionary and look up every word’s category, which would result in a list like this: Noun, Verb, Article, Adjective, Noun. Natural language processing calls these “Parts of Speech”. In formal languages–like programming languages– they’re called tokens.

A lexer works by reading the template and matching the stream of text with a set of regular expressions for each category in a given order. The first one that matches defines the category of the token and attaches relevant data to it.

With this little bit of theory out of the way, let’s implement a lexer for our template language. To make things a little bit easier, we use StringScanner by requiring strscan from Ruby’s standard library. (By the way, we’ve got an excellent intro to StringScanner in one of our previous editions.) As a first step, let’s build a minimal version that identifies everything as CONTENT.

We do this by creating a new StringScanner instance and letting it do its job using an until loop that only stops when the scanner reaches the end of the string.

For now, we just let it match every character (.*) across multiple lines (the m modifier) and return one CONTENT token for all of it. We represent a token as an array with the token name as the first element and any data as the second element. Our very basic lexer looks something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
require 'strscan'

module Magicbars
  class Lexer
    def self.tokenize(code)
      new.tokenize(code)
    end

    def tokenize(code)
      scanner = StringScanner.new(code)
      tokens = []

      until scanner.eos?
        tokens << [:CONTENT, scanner.scan(/.*?/m)]
      end

      tokens
    end
  end
end

When running this code with Welcome to {{name}} we get back a list of precisely one CONTENT token with all of the code attached to it.

1
2
Magicbars::Lexer.tokenize("Welcome to {{name}}")
=> [[:CONTENT, "Welcome to {{name}}"]]

Next, let’s detect the expression. To do so, we modify the code inside the loop, so it matches {{ and }} as OPEN_EXPRESSION and CLOSE.

We do this by adding a conditional that checks for the different cases.

1
2
3
4
5
6
7
8
9
until scanner.eos?
  if scanner.scan(/{{/)
    tokens << [:OPEN_EXPRESSION]
  elsif scanner.scan(/}}/)
    tokens << [:CLOSE]
  elsif scanner.scan(/.*?/m)
    tokens << [:CONTENT, scanner.matched]
  end
end

There’s no added value in attaching the curly braces to the OPEN_EXPRESSION and CLOSE tokens, so we drop them. As the scan calls are now part of the condition, we use scanner.matched to attach the result of the last match to the CONTENT token.

Unfortunately, when rerunning the lexer, we still get only one CONTENT token like before. We still have to modify the last expression to match everything up to the open expression. We do this by using scan_until with a positive lookahead anchor for double curly braces that stops the scanner right before them. Our code inside the loop now looks like this:

1
2
3
4
5
6
7
8
9
until scanner.eos?
  if scanner.scan(/{{/)
    tokens << [:OPEN_EXPRESSION]
  elsif scanner.scan(/}}/)
    tokens << [:CLOSE]
  elsif scanner.scan_until(/.*?(?={{|}})/m)
    tokens << [:CONTENT, scanner.matched]
  end
end

Running the lexer again, now results in four tokens:

1
2
Magicbars::Lexer.tokenize("Welcome to {{name}}")
=> [[:CONTENT, "Welcome to "], [:OPEN_EXPRESSION], [:CONTENT, "name"], [:CLOSE]]

Our lexer looks pretty close to the result we want. However, name isn’t regular content; it’s an identifier! Strings between double curly braces should be treated differently than strings outside.

A State Machine

To do this, we turn the lexer into a state machine with two distinct states. It starts in the default state. When it hit’s an OPEN_EXPRESSION token, it moves to the expression state and stays there until it comes across a CLOSE token which makes it transition back to the default state.

State machine

We implement the state machine by adding a few methods that use an array to manage the current state.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def stack
  @stack ||= []
end

def state
  stack.last || :default
end

def push_state(state)
  stack.push(state)
end

def pop_state
  stack.pop
end

The state method will either return the current state or default. push_state moves the lexer into a new state by adding it to the stack. pop_state moves the lexer back to the previous state.

Next, we split up the conditional within the loop and wrap it by a conditional that checks for the current state. While in the default state, we handle both OPEN_EXPRESSION and CONTENT tokens. This also means that the regular expression for CONTENT doesn’t need the }} lookahead anymore, so we drop it. In the expression state, we handle the CLOSE token and add a new regular expression for IDENTIFIER. Of course, we also implement the state transitions by adding a push_state call to OPEN_EXPRESSION and a pop_state call to CLOSE.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
if state == :default
  if scanner.scan(/{{/)
    tokens << [:OPEN_EXPRESSION]
    push_state :expression
  elsif scanner.scan_until(/.*?(?={{)/m)
    tokens << [:CONTENT, scanner.matched]
  end
elsif state == :expression
  if scanner.scan(/}}/)
    tokens << [:CLOSE]
    pop_state
  elsif scanner.scan(/[\w\-]+/)
    tokens << [:IDENTIFIER, scanner.matched]
  end
end

With these changes in place, the lexer now properly tokenizes our example.

1
2
Magicbars::Lexer.tokenize("Welcome to {{name}}")
# => [[:CONTENT, "Welcome to "], [:OPEN_EXPRESSION], [:IDENTIFIER, "name"], [:CLOSE]]

Making it harder for ourselves

Let’s move on to a more advanced example. This one uses multiple expressions, as well as a block.

1
2
3
4
5
6
7
8
9
Welcome to {{name}}!

{{#if subscribed}}
  Thank you for subscribing to our mailing list.
{{else}}
  Please sign up for our mailing list to be notified about new articles!
{{/if}}

Your friends at {{company_name}}

It’s no surprise that our lexer fails to parse this example. To make it work, we have to add the missing tokens and make it handle the content after the last expression. The code inside the loop looks something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
if state == :default
  if scanner.scan(/{{#/)
    tokens << [:OPEN_BLOCK]
    push_state :expression
  elsif scanner.scan(/{{\//)
    tokens << [:OPEN_END_BLOCK]
    push_state :expression
  elsif scanner.scan(/{{else/)
    tokens << [:OPEN_INVERSE]
    push_state :expression
  elsif scanner.scan(/{{/)
    tokens << [:OPEN_EXPRESSION]
    push_state :expression
  elsif scanner.scan_until(/.*?(?={{)/m)
    tokens << [:CONTENT, scanner.matched]
  else
    tokens << [:CONTENT, scanner.rest]
    scanner.terminate
  end
elsif state == :expression
  if scanner.scan(/\s+/)
    # Ignore whitespace
  elsif scanner.scan(/}}/)
    tokens << [:CLOSE]
    pop_state
  elsif scanner.scan(/[\w\-]+/)
    tokens << [:IDENTIFIER, scanner.matched]
  else
    scanner.terminate
  end
end

Please keep in mind that the order of the conditions is important to some extent. The first regular expression that matches is assigned. Thus, more specific expressions have to come before more generic ones. The prime example of this is the collection of specialized open tokens for blocks.

Using the final version of the lexer, the example now tokenizes into this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
[
  [:CONTENT, "Welcome to "],
  [:OPEN_EXPRESSION],
  [:IDENTIFIER, "name"],
  [:CLOSE],
  [:CONTENT, "!\n\n"],
  [:OPEN_BLOCK],
  [:IDENTIFIER, "if"],
  [:IDENTIFIER, "subscribed"],
  [:CLOSE],
  [:CONTENT, "\n  Thank you for subscribing to our mailing list.\n"],
  [:OPEN_INVERSE],
  [:CLOSE],
  [:CONTENT, "\n  Please sign up for our mailing list to be notified about new articles!\n"],
  [:OPEN_END_BLOCK],
  [:IDENTIFIER, "if"],
  [:CLOSE],
  [:CONTENT, "\n\nYour friends at "],
  [:OPEN_EXPRESSION],
  [:IDENTIFIER, "company_name"],
  [:CLOSE],
  [:CONTENT, "\n"]
]

Now that we’re finished, we’ve identified seven different types of tokens:

Token Example
OPEN_BLOCK {{#
OPEN_END_BLOCK {{/
OPEN_INVERSE {{else
OPEN_EXPRESSION {{
CONTENT Anything outside of expressions (normal HTML or Text)
CLOSE }}
IDENTIFIER Identifiers consist of Word characters, numbers, _, and -

The next step is to implement a parser that tries to figure out the structure of the token stream and translates it into an abstract syntax tree, but that’s for another time.

The Road Ahead

We started our journey towards our own templating language by looking at different ways to implement a basic templating system using string interpolation. When we hit the limits of the first approaches, we started implementing a proper templating system.

For now, we implemented a lexer that analyses the template and figures out the different types of tokens. In an upcoming edition of Ruby Magic, we’ll continue the journey by implementing a parser as well as an interpreter to generate an interpolated string.

Guest writer Benedikt Deicke is a software engineer and co-founder of Userlist.io. On the side, he’s writing a book about building SaaS applications in Ruby on Rails. You can reach out to Benedikt via Twitter.

Latest Ruby Magic articles (see all)

10 latest articles

Go back
Ruby magic icon

Subscribe to

Ruby Magic

Magicians never share their secrets. But we do. Sign up for our Ruby Magic email series and receive deep insights about garbage collection, memory allocation, concurrency and much more.

We'd like to set cookies, read why.