An Introduction to Ruby Parsing with Prism

You might have heard about Prism, the new Ruby parser. Perhaps you've heard it's faster, more reliable, and more powerful than what we had before. Or maybe you never took a compilers class and aren't sure about what this actually means.

I'm here to tell you all about it, and how it's changing our lives as Ruby developers. Today, I want to take you from square one to writing your first transpiler.

Interpreters 101

Before we begin our journey, let's start with the basics of how an interpreter works, so we're all on the same page. Interpreting a programming language usually involves three main steps:

Tokenizing input (a.k.a. lexing): Breaking the input text into a list of meaningful tokens. That's like converting your code into something like this:

Ruby

tokens = [
  { type: :integer,    literal: "0",      value: 0,   line: 1 },
  { type: :operator,   literal: "+",      value: nil, line: 1 },
  { type: :integer,    literal: "1",      value: 1,   line: 1 },
  { type: :keyword,    literal: "if",     value: 1,   line: 1 },
  { type: :identifier, literal: "admin?", value: nil, line: 1 }
  # ...
]

Parsing: Analyzing the tokens to understand the program structure (what to do and in which order) and building a data representation that holds that information (known as an Abstract Syntax Tree). For example:

Ruby

 ast = {
   node_type: :binary,
   operation: "+",
   left: {
     node_type: :number,
     value: 0
   },
   right: {
     node_type: :number,
     value: 1
   },
 }

Evaluating: Executing the parsed input and producing an output. This is where your code actually runs.

For a deeper dive into this topic, I recommend the Crafting Interpreters book or my RailsConf talk on the subject.

Now let's dive into what Prism can do and how it helps with parsing.

Why Is Prism Useful for Ruby Parsing?

Ruby historically used a parser called parse.y, built with Yacc. The catch? It was made specifically for CRuby, forcing other Ruby implementations (like JRuby and TruffleRuby) to create their own parsers from scratch.

That's why tools like RuboCop, code editors, and even other Ruby implementations often lagged behind or had incompatibilities with newer Ruby syntax. Developers building Ruby analysis tools had to write their own parsers too, spawning projects like whitequark/parser and ruby_parser.

Prism solves this by becoming the de facto parser for all Ruby tools and implementations. And it's working: it is now used in CRuby, JRuby, TruffleRuby, Rails, RuboCop, and more.

Okay, enough talk. Prism can lex and parse Ruby, which allows us to build fun things. How about we build a ✨ transpiler ✨ with it? Wait! Don't go away. This will be simple. I promise.

Your First Transpiler

The full code for the examples in this post is available in this repository.

First, we'll build a tool that converts our Ruby code into Emoruby. If you have never seen Emoruby, this is what it looks like:

Ruby

📋 ❤️
  🔜 👋
    👀 💬😃 🌏💬
  🔚
🔚
 
❤️▪️🐣▪️👋

Equivalent to this in Ruby:

Ruby

class Heart
  def wave
    puts "smiley earth_asia"
  end
end
 
Heart.new.wave

Ready? Ok, here we go. We'll need a Gemfile to install emoruby and prism:

Ruby

source 'https://rubygems.org'
 
gem "emoruby", git: "https://github.com/searls/emoruby", branch: "master"
gem "prism"

After bundle installing, let's create the entry point for our transpiler — the Rubyemo.ruby_to_emoji method:

Ruby

require 'emoruby'
require 'prism'
 
module Rubyemo
  extend self
 
  def ruby_to_emoji(src)
    tokenize(src)
  end
 
  private
 
  def tokenize(src)
    result = Prism.lex(src)
    raise "Invalid Ruby code" if result.errors.any?
 
    result.value.map(&:first)
  end
end

For now, it only tokenizes the input with Prism. The lex method returns a result object that contains either the tokens or errors. If the source code contains invalid Ruby, we'll just raise an exception.

Then we get the value attribute, which contains a list of tokens and some other stuff. We only care about the tokens, so we grab them with map(&:first).

Emojify the Ruby Tokens

Now onto the fun part. How do we emojify our Ruby tokens? Emoruby has a very simple design, so we can basically replace tokens one by one with an emoji alternative:

Ruby

module Rubyemo
  extend self
 
  def ruby_to_emoji(src)
    tokenize(src)
      .then { emojify it }
      .join
  end
 
  private
 
  def emojify(tokens)
    tokens.filter_map do |token|
      next if token.type == :EOF
 
      token_to_emoji(token)
    end
  end
 
  # ...
end

To do that mapping, we'll use Emoruby's translation file and EmojiData to translate token values to emojis by name.

Ruby

module Rubyemo
  # ...
 
  TRANSLATIONS = Emoruby::ConvertsRubyToEmoji::TRANSLATIONS
 
  def token_to_emoji(token)
    case token
    in {type: :COMMENT, value:}
      "💭#{value[1..]}"
    in {type: :IGNORED_NEWLINE | :NEWLINE}
      "\n"
    else
      token.value.to_s.split.map do |part|
        TRANSLATIONS[part] || EmojiData.from_short_name(part)&.to_s || part
      end.join(" ")
    end
  end

If a token has no mapping, we leave it as-is so the code still runs. Pattern matching is pretty handy here to match a particular type and grab its value at the same time.

And... that's it! Let's see our little transpiler in action:

Ruby

emo = Rubyemo.ruby_to_emoji('puts "Hello, world!"')
puts emo # 👀💬Hello, world!💬
 
Emoruby.eval(emo) # Hello, world!

Fix Indentation and Spacing

You might not have noticed, but our code doesn't handle indentation or spacing. See how the space between puts and the opening quote got lost? We can do better than this, so let's handle that. Luckily, the Prism::Token instances have information about their location, including the start/end line and column.

Let's change emojify to this:

Ruby

def emojify(tokens)
  previous_line, previous_column = 1, 0
 
  tokens.filter_map do |token|
    next if token.type == :EOF
 
    emoji = token_to_emoji(token)
    indentation, previous_line, previous_column = indentation_for(
      token,
      previous_line,
      previous_column
    )
 
    indentation + emoji
  end
 
  # ...
end

And implement indentation_for:

Ruby

def indentation_for(token, previous_line, previous_column)
  if token.location.start_line != previous_line
    previous_line = token.location.start_line
    previous_column = 0
  end
  indentation = " " * (token.location.start_column - previous_column)
  previous_column = token.location.end_column
 
  [indentation, previous_line, previous_column]
end

Try emojifying this now:

Ruby

ruby = <<~RUBY
  class Heart
    public def jeans
      puts "purse"
    end
 
    protected def shirt
      puts "yellow_heart"
    end
 
    private def wave
      puts "smiley earth_asia"
    end
  end
 
  Heart.new.wave
RUBY
 
puts Rubyemo.ruby_to_emoji(ruby)
# 📋 ❤️
#   🔓 🔜 👖
#     👀 💬👛💬
#   🔚
#
#   🔒 🔜 👕
#     👀 💬💛💬
#   🔚
#
#   ⛔️ 🔜 👋
#     👀 💬😃 🌏💬
#   🔚
# 🔚
#
# ❤️▪️🐣▪️👋

Done — for real.

Onto Parsing with Prism for Ruby

Rubyemo helped us to learn Prism lexing, but we didn't need any parsing. Let's try another example. Let's say you learned about Ruby 3.2's Data class and you want to rewrite your old structs with it. Why spend 10 minutes doing it manually when you can write a script in one hour that does it?

Note: If you think this sounds like a RuboCop cop, you're 100% correct! You could turn this into a custom cop if you wanted.

While Prism has a method to parse code, for this, we'll use its Visitor class instead. The visitor design pattern allows us to add new operations to objects (in this case, the Prism AST nodes) without changing their classes. In other words, for each node type it finds, the Visitor class will call a method, and we decide what to do in that situation.

We need to act when we find Struct.new, so that's a method call, which in Prism is identified by the CallNode type. Let's filter those:

Ruby

require "prism"
 
class StructToData < Prism::Visitor
  Fix = Data.define(:location, :replacement)
 
  attr_reader :fixes
 
  def initialize(src)
    super()
    @src = src
    @fixes = []
  end
 
  def visit_call_node(node)
    if struct_new?(node)
      # todo
    end
 
    super
  end
 
  private
 
  def struct_new?(node)
    node.name == :new &&
      node.receiver.is_a?(Prism::ConstantReadNode) &&
      node.receiver.name == :Struct
  end
end

Note that we need to call super on the visit methods, which makes Prism keep walking inner nodes (like the body of a class definition).

Now we need to collect the struct arguments and build our fix object with the replacement code. We'll also skip named structs, as those don't map 1:1 to Data classes:

Ruby

# ...
def visit_call_node(node)
  if struct_new?(node) && !named_struct?(node)
    members = struct_members(node)
    replacement = build_replacement(members, node.block)
    @fixes << Fix.new(node.location, replacement)
  end
 
  super
end
 
private
 
# ...
 
# skips interpolated symbols for simplicity
def struct_members(node)
  (node.arguments&.arguments || [])
    .take_while { it.is_a?(Prism::SymbolNode) }
    .map(&:slice)
end
 
def named_struct?(node)
  (node.arguments&.arguments || [])
    .first
    .is_a?(Prism::StringNode)
end
 
def build_replacement(node_members, node_block)
  call = "Data.define(#{node_members.join(", ")})"
  if node_block
    call += " #{node_block.slice}"
  end
  call
end

Make Fixes with a Method

Now let's write a method to apply the fixes. We'll process them in ascending order of start position, so offsets remain correct as we build the new source.

Ruby

class StructToData < Prism::Visitor
  #...
  def self.rewrite(source)
    ast = Prism.parse(source)
    return [source, []] unless ast.success?
 
    v = new(source)
    v.visit(ast.value)
    [v.apply_fixes, v.fixes]
  end
 
  # ...
 
  private
 
  def apply_fixes
    return @src if @fixes.empty?
 
    pos = 0
    out = +""
    @fixes.sort_by { it.location.start_offset }.each do |fix|
      out << @src.byteslice(pos...fix.location.start_offset)
      out << fix.replacement
      pos = fix.location.end_offset
    end
    out << @src.byteslice(pos..-1)
 
    out
  end

Note: We have to use byteslice because Prism offsets are in bytes, while replacing using something like String#[]= would fail on multibyte characters.

This is enough for us to test our code. Let's see it in action:

Ruby

source = "Point = Struct.new(:x, :y, keyword_init: true)"
 
rewritten, _fixes = StructToData.rewrite(source)
 
rewritten =="Point = Data.define(:x, :y)" # => true

It works!

Mutation: The Source of All Evil

There's one big problem with our current approach: Data objects are immutable, while structs aren't, so we can't always convert them. We have to also skip structs that mutate internal state.

We'll use an inner visitor to check if a struct body contains mutations:

Ruby

class MutationScanner < Prism::Visitor
  def initialize
    super
    @mutates = false
  end
 
  def mutates? = @mutates
 
  def visit_call_node(n)
    # self.x = ..., self[:k] = ...
    if n.receiver.is_a?(Prism::SelfNode) && n.name.to_s.end_with?("=")
      @mutates = true
    end
 
    # adding writers via macros
    if n.name == :attr_writer || n.name == :attr_accessor
      @mutates = true
    end
 
    # define_method(:x=) { ... }
    if n.name == :define_method
      arg = n.arguments&.arguments&.first
      if (arg.is_a?(Prism::SymbolNode) || arg.is_a?(Prism::StringNode)) && arg.unescaped.end_with?("=")
        @mutates = true
      end
    end
 
    super
  end
 
  # def x=(...) / def []=(...)
  def visit_def_node(n)
    if n.name.to_s.end_with?("=")
      @mutates = true
    end
    super
  end
end

This catches many cases, but there are many more ways to mutate a value in Ruby (i.e., by writing to an instance variable): writing to ivars directly, using attribute writers, or memoization, to name a few. It would be a chore to define methods for each one manually.

Luckily, Prism consistently names the nodes for these mutation methods, so we'll just flag any nodes that perform a write:

Ruby

class MutationScanner < Prism::Visitor
  # ... after def visit_def_node
 
  Prism
    .constants
    .filter_map do |const_name|
      next if const_name !~ /Write/ || const_name =~ /GlobalVariable|LocalVariable|Constant/
 
      Prism.const_get(const_name)
    end
    .each do |node_class|
      define_method("visit_#{node_class.type}") do |n|
        @mutates = true
        super(n)
      end
    end

So we get all "write" constants and define methods, but ignore writing to constants, local variables, and global variables (as those don't change a struct's internal state).

Let's wire up this scanner now:

Ruby

class StructToData < Prism::Visitor
  #...
  def visit_call_node(node)
    if struct_new?(node) && !named_struct?(node) && !mutates_instance_state?(node.block)
      # build fix
    end
  end
 
  # ...
 
  def mutates_instance_state?(block_node)
    return false if block_node.nil?
 
    scanner = MutationScanner.new
    scanner.visit(block_node)
    scanner.mutates?
  end
end

That's it! Now our rewriter is ready to dance. Try it out with some of your structs.

Wrapping Up

Prism has already reshaped the Ruby landscape by making our tools faster, more portable, and more consistent. But its real impact will come from what you build with it.

Think bigger than just parsing: a Ruby-to-JS transpiler, a test runner that knows exactly which test to run from a file and line number, or even something that turns your code into pixel art. The parser is no longer the bottleneck — your imagination is.

Go make something amazing!