ruby

Brewing our own Template Lexer in Ruby

Benedikt Deicke

Benedikt Deicke on

Brewing our own Template Lexer in Ruby

Put on your scuba diving suite and pack your stencils, we're diving into Templates today!

Most software that renders web pages or generates emails uses templating to embed variable data into text documents. The main structure of the document is often set up in a static template with placeholders for the data. The variable data, like user names or web page contents, replace the placeholders while rendering the page.

For our dive into templating, we'll implement a subset of Mustache, a templating language that's available in many programming languages. In this episode, we'll investigate different ways of templating. We'll start out looking at string concatenation, and end up writing our own lexer to allow for more complex templates.

Using Native String Interpolation

Let's start with a minimal example. Our application needs a welcome message that happens to include a project name. The quickest way to do this is by using Ruby's built-in string interpolation feature.

Ruby
name = "Ruby Magic" template = "Welcome to #{name}" # => Welcome to Ruby Magic

Great! That was doable. However, what if we want to reuse the template for multiple occasions, or allow our users to update the template?

The interpolation evaluates immediately. We can't reuse the template (unless we redefine it—in a loop, for instance) and we can't store the Welcome to #{name} template in a database and populate it later without using the potentially dangerous eval function.

Luckily, Ruby has a different way of interpolating strings: Kernel#sprintf or String#%. These methods allow us to get an interpolated string without changing the template itself. This way, we can reuse the same template multiple times. It also doesn't allow execution of arbitrary Ruby code. Let's use it.

Ruby
name = "Ruby Magic" template = "Welcome to %{name}" sprintf(template, name: name) # => "Welcome to Ruby Magic" template % { name: name } # => "Welcome to Ruby Magic"

The Regexp Approach to Templating

While the above solution works, it's not fool-proof, and it exposes more functionality than we usually want to. Let's look at an example:

Ruby
name = "Ruby Magic" template = "Welcome to %d" sprintf(template, name: name) # => TypeError (can't convert Hash into Integer)

Both Kernel#sprintf and String#% allow special syntax to handle different types of data. Not all of them are compatible with the data we pass. In this example, the template expects to format a number but gets passed a Hash, producing a TypeError.

But we have more power tools in our shed: we can implement our own interpolation using regular expressions. Using regular expressions allows us to define a custom syntax, like a Mustache/Handlebars inspired style.

Ruby
name = "Ruby Magic" template = "Welcome to {{name}}" assigns = { "name" => name } template.gsub(/{{(\w+)}}/) { assigns[$1] } # => Welcome to Ruby Magic

We use String#gsub to replace all placeholders (words in double curly braces) with their value in the assigns hash. If there is no corresponding value, this method removes the placeholder without inserting anything.

Replacing placeholders in a string like this is a viable solution for a string with a couple of placeholders. However, once things get a bit more complicated, we quickly run into problems.

Let's say we need to have conditionals in the template. The result should be different based on the value of a variable.

handlebars
Welcome to {{name}}! {{#if subscribed}} Thank you for subscribing to our mailing list. {{else}} Please sign up for our mailing list to be notified about new articles! {{/if}} Your friends at {{company_name}}

Regular expressions can't smoothly handle this use case. If you try hard enough, you can probably still hack something together, but at this point, it's better to build a proper templating language.

Building a Templating Language

Implementing a templating language is similar to implementing other programming languages. Just like a scripting language, a template language needs three components: A lexer, a parser, and an interpreter. We'll look at these, one by one.

Lexer

The first task we need to tackle is called tokenization, or lexical analysis. The process is very similar to identifying word categories in natural languages.

Take an example like Ruby is a lovely language. The sentence consists of five words of different categories. To identify what category they are, you'd take a dictionary and look up every word's category, which would result in a list like this: Noun, Verb, Article, Adjective, Noun. Natural language processing calls these "Parts of Speech". In formal languages--like programming languages-- they're called tokens.

A lexer works by reading the template and matching the stream of text with a set of regular expressions for each category in a given order. The first one that matches defines the category of the token and attaches relevant data to it.

With this little bit of theory out of the way, let's implement a lexer for our template language. To make things a little bit easier, we use StringScanner by requiring strscan from Ruby's standard library. (By the way, we've got an excellent intro to StringScanner in one of our previous editions.) As a first step, let's build a minimal version that identifies everything as CONTENT.

We do this by creating a new StringScanner instance and letting it do its job using an until loop that only stops when the scanner reaches the end of the string.

For now, we just let it match every character (.*) across multiple lines (the m modifier) and return one CONTENT token for all of it. We represent a token as an array with the token name as the first element and any data as the second element. Our very basic lexer looks something like this:

Ruby
require 'strscan' module Magicbars class Lexer def self.tokenize(code) new.tokenize(code) end def tokenize(code) scanner = StringScanner.new(code) tokens = [] until scanner.eos? tokens << [:CONTENT, scanner.scan(/.*?/m)] end tokens end end end

When running this code with Welcome to {{name}} we get back a list of precisely one CONTENT token with all of the code attached to it.

Ruby
Magicbars::Lexer.tokenize("Welcome to {{name}}") => [[:CONTENT, "Welcome to {{name}}"]]

Next, let's detect the expression. To do so, we modify the code inside the loop, so it matches {{ and }} as OPEN_EXPRESSION and CLOSE.

We do this by adding a conditional that checks for the different cases.

Ruby
until scanner.eos? if scanner.scan(/{{/) tokens << [:OPEN_EXPRESSION] elsif scanner.scan(/}}/) tokens << [:CLOSE] elsif scanner.scan(/.*?/m) tokens << [:CONTENT, scanner.matched] end end

There's no added value in attaching the curly braces to the OPEN_EXPRESSION and CLOSE tokens, so we drop them. As the scan calls are now part of the condition, we use scanner.matched to attach the result of the last match to the CONTENT token.

Unfortunately, when rerunning the lexer, we still get only one CONTENT token like before. We still have to modify the last expression to match everything up to the open expression. We do this by using scan_until with a positive lookahead anchor for double curly braces that stops the scanner right before them. Our code inside the loop now looks like this:

Ruby
until scanner.eos? if scanner.scan(/{{/) tokens << [:OPEN_EXPRESSION] elsif scanner.scan(/}}/) tokens << [:CLOSE] elsif scanner.scan_until(/.*?(?={{|}})/m) tokens << [:CONTENT, scanner.matched] end end

Running the lexer again, now results in four tokens:

Ruby
Magicbars::Lexer.tokenize("Welcome to {{name}}") => [[:CONTENT, "Welcome to "], [:OPEN_EXPRESSION], [:CONTENT, "name"], [:CLOSE]]

Our lexer looks pretty close to the result we want. However, name isn't regular content; it's an identifier! Strings between double curly braces should be treated differently than strings outside.

A State Machine

To do this, we turn the lexer into a state machine with two distinct states. It starts in the default state. When it hit's an OPEN_EXPRESSION token, it moves to the expression state and stays there until it comes across a CLOSE token which makes it transition back to the default state.

State Machine

We implement the state machine by adding a few methods that use an array to manage the current state.

Ruby
def stack @stack ||= [] end def state stack.last || :default end def push_state(state) stack.push(state) end def pop_state stack.pop end

The state method will either return the current state or default. push_state moves the lexer into a new state by adding it to the stack. pop_state moves the lexer back to the previous state.

Next, we split up the conditional within the loop and wrap it by a conditional that checks for the current state. While in the default state, we handle both OPEN_EXPRESSION and CONTENT tokens. This also means that the regular expression for CONTENT doesn't need the }} lookahead anymore, so we drop it. In the expression state, we handle the CLOSE token and add a new regular expression for IDENTIFIER. Of course, we also implement the state transitions by adding a push_state call to OPEN_EXPRESSION and a pop_state call to CLOSE.

Ruby
if state == :default if scanner.scan(/{{/) tokens << [:OPEN_EXPRESSION] push_state :expression elsif scanner.scan_until(/.*?(?={{)/m) tokens << [:CONTENT, scanner.matched] end elsif state == :expression if scanner.scan(/}}/) tokens << [:CLOSE] pop_state elsif scanner.scan(/[\w\-]+/) tokens << [:IDENTIFIER, scanner.matched] end end

With these changes in place, the lexer now properly tokenizes our example.

Ruby
Magicbars::Lexer.tokenize("Welcome to {{name}}") # => [[:CONTENT, "Welcome to "], [:OPEN_EXPRESSION], [:IDENTIFIER, "name"], [:CLOSE]]

Making it harder for ourselves

Let's move on to a more advanced example. This one uses multiple expressions, as well as a block.

Ruby
Welcome to {{name}}! {{#if subscribed}} Thank you for subscribing to our mailing list. {{else}} Please sign up for our mailing list to be notified about new articles! {{/if}} Your friends at {{company_name}}

It's no surprise that our lexer fails to parse this example. To make it work, we have to add the missing tokens and make it handle the content after the last expression. The code inside the loop looks something like this:

Ruby
if state == :default if scanner.scan(/{{#/) tokens << [:OPEN_BLOCK] push_state :expression elsif scanner.scan(/{{\//) tokens << [:OPEN_END_BLOCK] push_state :expression elsif scanner.scan(/{{else/) tokens << [:OPEN_INVERSE] push_state :expression elsif scanner.scan(/{{/) tokens << [:OPEN_EXPRESSION] push_state :expression elsif scanner.scan_until(/.*?(?={{)/m) tokens << [:CONTENT, scanner.matched] else tokens << [:CONTENT, scanner.rest] scanner.terminate end elsif state == :expression if scanner.scan(/\s+/) # Ignore whitespace elsif scanner.scan(/}}/) tokens << [:CLOSE] pop_state elsif scanner.scan(/[\w\-]+/) tokens << [:IDENTIFIER, scanner.matched] else scanner.terminate end end

Please keep in mind that the order of the conditions is important to some extent. The first regular expression that matches is assigned. Thus, more specific expressions have to come before more generic ones. The prime example of this is the collection of specialized open tokens for blocks.

Using the final version of the lexer, the example now tokenizes into this:

Ruby
[ [:CONTENT, "Welcome to "], [:OPEN_EXPRESSION], [:IDENTIFIER, "name"], [:CLOSE], [:CONTENT, "!\n\n"], [:OPEN_BLOCK], [:IDENTIFIER, "if"], [:IDENTIFIER, "subscribed"], [:CLOSE], [:CONTENT, "\n Thank you for subscribing to our mailing list.\n"], [:OPEN_INVERSE], [:CLOSE], [:CONTENT, "\n Please sign up for our mailing list to be notified about new articles!\n"], [:OPEN_END_BLOCK], [:IDENTIFIER, "if"], [:CLOSE], [:CONTENT, "\n\nYour friends at "], [:OPEN_EXPRESSION], [:IDENTIFIER, "company_name"], [:CLOSE], [:CONTENT, "\n"] ]

Now that we're finished, we've identified seven different types of tokens:

TokenExample
OPEN_BLOCK{{#
OPEN_END_BLOCK{{/
OPEN_INVERSE{{else
OPEN_EXPRESSION{{
CONTENTAnything outside of expressions (normal HTML or Text)
CLOSE}}
IDENTIFIERIdentifiers consist of Word characters, numbers, _, and -

The next step is to implement a parser that tries to figure out the structure of the token stream and translates it into an abstract syntax tree, but that's for another time.

The Road Ahead

We started our journey towards our own templating language by looking at different ways to implement a basic templating system using string interpolation. When we hit the limits of the first approaches, we started implementing a proper templating system.

For now, we implemented a lexer that analyses the template and figures out the different types of tokens. In an upcoming edition of Ruby Magic, we'll continue the journey by implementing a parser as well as an interpreter to generate an interpolated string.

Benedikt Deicke

Benedikt Deicke

Guest author Benedikt Deicke is a software engineer and CTO of Userlist. On the side, he’s writing a book about building SaaS applications in Ruby on Rails.

All articles by Benedikt Deicke

Become our next author!

Find out more

AppSignal monitors your apps

AppSignal provides insights for Ruby, Rails, Elixir, Phoenix, Node.js, Express and many other frameworks and libraries. We are located in beautiful Amsterdam. We love stroopwafels. If you do too, let us know. We might send you some!

Discover AppSignal
AppSignal monitors your apps