ruby

An Introduction to Ruby Parsing with Prism

Matheus Richard

Matheus Richard on

An Introduction to Ruby Parsing with Prism

You might have heard about Prism, the new Ruby parser. Perhaps you've heard it's faster, more reliable, and more powerful than what we had before. Or maybe you never took a compilers class and aren't sure about what this actually means.

I'm here to tell you all about it, and how it's changing our lives as Ruby developers. Today, I want to take you from square one to writing your first transpiler.

Interpreters 101

Before we begin our journey, let's start with the basics of how an interpreter works, so we're all on the same page. Interpreting a programming language usually involves three main steps:

  1. Tokenizing input (a.k.a. lexing): Breaking the input text into a list of meaningful tokens. That's like converting your code into something like this:
Ruby
tokens = [ { type: :integer, literal: "0", value: 0, line: 1 }, { type: :operator, literal: "+", value: nil, line: 1 }, { type: :integer, literal: "1", value: 1, line: 1 }, { type: :keyword, literal: "if", value: 1, line: 1 }, { type: :identifier, literal: "admin?", value: nil, line: 1 } # ... ]
  1. Parsing: Analyzing the tokens to understand the program structure (what to do and in which order) and building a data representation that holds that information (known as an Abstract Syntax Tree). For example:
Ruby
ast = { node_type: :binary, operation: "+", left: { node_type: :number, value: 0 }, right: { node_type: :number, value: 1 }, }
  1. Evaluating: Executing the parsed input and producing an output. This is where your code actually runs.

For a deeper dive into this topic, I recommend the Crafting Interpreters book or my RailsConf talk on the subject.

Now let's dive into what Prism can do and how it helps with parsing.

Why Is Prism Useful for Ruby Parsing?

Ruby historically used a parser called parse.y, built with Yacc. The catch? It was made specifically for CRuby, forcing other Ruby implementations (like JRuby and TruffleRuby) to create their own parsers from scratch.

That's why tools like RuboCop, code editors, and even other Ruby implementations often lagged behind or had incompatibilities with newer Ruby syntax. Developers building Ruby analysis tools had to write their own parsers too, spawning projects like whitequark/parser and ruby_parser.

Prism solves this by becoming the de facto parser for all Ruby tools and implementations. And it's working: it is now used in CRuby, JRuby, TruffleRuby, Rails, RuboCop, and more.

Okay, enough talk. Prism can lex and parse Ruby, which allows us to build fun things. How about we build a ✨ transpiler ✨ with it? Wait! Don't go away. This will be simple. I promise.

Your First Transpiler

The full code for the examples in this post is available in this repository.

First, we'll build a tool that converts our Ruby code into Emoruby. If you have never seen Emoruby, this is what it looks like:

Ruby
πŸ“‹ ❀️ πŸ”œ πŸ‘‹ πŸ‘€ πŸ’¬πŸ˜ƒ πŸŒπŸ’¬ πŸ”š πŸ”š ❀️β–ͺ️🐣β–ͺοΈπŸ‘‹

Equivalent to this in Ruby:

Ruby
class Heart def wave puts "smiley earth_asia" end end Heart.new.wave

Ready? Ok, here we go. We'll need a Gemfile to install emoruby and prism:

Ruby
source 'https://rubygems.org' gem "emoruby", git: "https://github.com/searls/emoruby", branch: "master" gem "prism"

After bundle installing, let's create the entry point for our transpiler β€” the Rubyemo.ruby_to_emoji method:

Ruby
require 'emoruby' require 'prism' module Rubyemo extend self def ruby_to_emoji(src) tokenize(src) end private def tokenize(src) result = Prism.lex(src) raise "Invalid Ruby code" if result.errors.any? result.value.map(&:first) end end

For now, it only tokenizes the input with Prism. The lex method returns a result object that contains either the tokens or errors. If the source code contains invalid Ruby, we'll just raise an exception.

Then we get the value attribute, which contains a list of tokens and some other stuff. We only care about the tokens, so we grab them with map(&:first).

Emojify the Ruby Tokens

Now onto the fun part. How do we emojify our Ruby tokens? Emoruby has a very simple design, so we can basically replace tokens one by one with an emoji alternative:

Ruby
module Rubyemo extend self def ruby_to_emoji(src) tokenize(src) .then { emojify it } .join end private def emojify(tokens) tokens.filter_map do |token| next if token.type == :EOF token_to_emoji(token) end end # ... end

To do that mapping, we'll use Emoruby's translation file and EmojiData to translate token values to emojis by name.

Ruby
module Rubyemo # ... TRANSLATIONS = Emoruby::ConvertsRubyToEmoji::TRANSLATIONS def token_to_emoji(token) case token in {type: :COMMENT, value:} "πŸ’­#{value[1..]}" in {type: :IGNORED_NEWLINE | :NEWLINE} "\n" else token.value.to_s.split.map do |part| TRANSLATIONS[part] || EmojiData.from_short_name(part)&.to_s || part end.join(" ") end end

If a token has no mapping, we leave it as-is so the code still runs. Pattern matching is pretty handy here to match a particular type and grab its value at the same time.

And... that's it! Let's see our little transpiler in action:

Ruby
emo = Rubyemo.ruby_to_emoji('puts "Hello, world!"') puts emo # πŸ‘€πŸ’¬Hello, world!πŸ’¬ Emoruby.eval(emo) # Hello, world!

Fix Indentation and Spacing

You might not have noticed, but our code doesn't handle indentation or spacing. See how the space between puts and the opening quote got lost? We can do better than this, so let's handle that. Luckily, the Prism::Token instances have information about their location, including the start/end line and column.

Let's change emojify to this:

Ruby
def emojify(tokens) previous_line, previous_column = 1, 0 tokens.filter_map do |token| next if token.type == :EOF emoji = token_to_emoji(token) indentation, previous_line, previous_column = indentation_for( token, previous_line, previous_column ) indentation + emoji end # ... end

And implement indentation_for:

Ruby
def indentation_for(token, previous_line, previous_column) if token.location.start_line != previous_line previous_line = token.location.start_line previous_column = 0 end indentation = " " * (token.location.start_column - previous_column) previous_column = token.location.end_column [indentation, previous_line, previous_column] end

Try emojifying this now:

Ruby
ruby = <<~RUBY class Heart public def jeans puts "purse" end protected def shirt puts "yellow_heart" end private def wave puts "smiley earth_asia" end end Heart.new.wave RUBY puts Rubyemo.ruby_to_emoji(ruby) # πŸ“‹ ❀️ # πŸ”“ πŸ”œ πŸ‘– # πŸ‘€ πŸ’¬πŸ‘›πŸ’¬ # πŸ”š # # πŸ”’ πŸ”œ πŸ‘• # πŸ‘€ πŸ’¬πŸ’›πŸ’¬ # πŸ”š # # ⛔️ πŸ”œ πŸ‘‹ # πŸ‘€ πŸ’¬πŸ˜ƒ πŸŒπŸ’¬ # πŸ”š # πŸ”š # # ❀️β–ͺ️🐣β–ͺοΈπŸ‘‹

Done β€” for real.

Onto Parsing with Prism for Ruby

Rubyemo helped us to learn Prism lexing, but we didn't need any parsing. Let's try another example. Let's say you learned about Ruby 3.2's Data class and you want to rewrite your old structs with it. Why spend 10 minutes doing it manually when you can write a script in one hour that does it?

Note: If you think this sounds like a RuboCop cop, you're 100% correct! You could turn this into a custom cop if you wanted.

While Prism has a method to parse code, for this, we'll use its Visitor class instead. The visitor design pattern allows us to add new operations to objects (in this case, the Prism AST nodes) without changing their classes. In other words, for each node type it finds, the Visitor class will call a method, and we decide what to do in that situation.

We need to act when we find Struct.new, so that's a method call, which in Prism is identified by the CallNode type. Let's filter those:

Ruby
require "prism" class StructToData < Prism::Visitor Fix = Data.define(:location, :replacement) attr_reader :fixes def initialize(src) super() @src = src @fixes = [] end def visit_call_node(node) if struct_new?(node) # todo end super end private def struct_new?(node) node.name == :new && node.receiver.is_a?(Prism::ConstantReadNode) && node.receiver.name == :Struct end end

Note that we need to call super on the visit methods, which makes Prism keep walking inner nodes (like the body of a class definition).

Now we need to collect the struct arguments and build our fix object with the replacement code. We'll also skip named structs, as those don't map 1:1 to Data classes:

Ruby
# ... def visit_call_node(node) if struct_new?(node) && !named_struct?(node) members = struct_members(node) replacement = build_replacement(members, node.block) @fixes << Fix.new(node.location, replacement) end super end private # ... # skips interpolated symbols for simplicity def struct_members(node) (node.arguments&.arguments || []) .take_while { it.is_a?(Prism::SymbolNode) } .map(&:slice) end def named_struct?(node) (node.arguments&.arguments || []) .first .is_a?(Prism::StringNode) end def build_replacement(node_members, node_block) call = "Data.define(#{node_members.join(", ")})" if node_block call += " #{node_block.slice}" end call end

Make Fixes with a Method

Now let's write a method to apply the fixes. We'll process them in ascending order of start position, so offsets remain correct as we build the new source.

Ruby
class StructToData < Prism::Visitor #... def self.rewrite(source) ast = Prism.parse(source) return [source, []] unless ast.success? v = new(source) v.visit(ast.value) [v.apply_fixes, v.fixes] end # ... private def apply_fixes return @src if @fixes.empty? pos = 0 out = +"" @fixes.sort_by { it.location.start_offset }.each do |fix| out << @src.byteslice(pos...fix.location.start_offset) out << fix.replacement pos = fix.location.end_offset end out << @src.byteslice(pos..-1) out end

Note: We have to use byteslice because Prism offsets are in bytes, while replacing using something like String#[]= would fail on multibyte characters.

This is enough for us to test our code. Let's see it in action:

Ruby
source = "Point = Struct.new(:x, :y, keyword_init: true)" rewritten, _fixes = StructToData.rewrite(source) rewritten =="Point = Data.define(:x, :y)" # => true

It works!

Mutation: The Source of All Evil

There's one big problem with our current approach: Data objects are immutable, while structs aren't, so we can't always convert them. We have to also skip structs that mutate internal state.

We'll use an inner visitor to check if a struct body contains mutations:

Ruby
class MutationScanner < Prism::Visitor def initialize super @mutates = false end def mutates? = @mutates def visit_call_node(n) # self.x = ..., self[:k] = ... if n.receiver.is_a?(Prism::SelfNode) && n.name.to_s.end_with?("=") @mutates = true end # adding writers via macros if n.name == :attr_writer || n.name == :attr_accessor @mutates = true end # define_method(:x=) { ... } if n.name == :define_method arg = n.arguments&.arguments&.first if (arg.is_a?(Prism::SymbolNode) || arg.is_a?(Prism::StringNode)) && arg.unescaped.end_with?("=") @mutates = true end end super end # def x=(...) / def []=(...) def visit_def_node(n) if n.name.to_s.end_with?("=") @mutates = true end super end end

This catches many cases, but there are many more ways to mutate a value in Ruby (i.e., by writing to an instance variable): writing to ivars directly, using attribute writers, or memoization, to name a few. It would be a chore to define methods for each one manually.

Luckily, Prism consistently names the nodes for these mutation methods, so we'll just flag any nodes that perform a write:

Ruby
class MutationScanner < Prism::Visitor # ... after def visit_def_node Prism .constants .filter_map do |const_name| next if const_name !~ /Write/ || const_name =~ /GlobalVariable|LocalVariable|Constant/ Prism.const_get(const_name) end .each do |node_class| define_method("visit_#{node_class.type}") do |n| @mutates = true super(n) end end

So we get all "write" constants and define methods, but ignore writing to constants, local variables, and global variables (as those don't change a struct's internal state).

Let's wire up this scanner now:

Ruby
class StructToData < Prism::Visitor #... def visit_call_node(node) if struct_new?(node) && !named_struct?(node) && !mutates_instance_state?(node.block) # build fix end end # ... def mutates_instance_state?(block_node) return false if block_node.nil? scanner = MutationScanner.new scanner.visit(block_node) scanner.mutates? end end

That's it! Now our rewriter is ready to dance. Try it out with some of your structs.

Wrapping Up

Prism has already reshaped the Ruby landscape by making our tools faster, more portable, and more consistent. But its real impact will come from what you build with it.

Think bigger than just parsing: a Ruby-to-JS transpiler, a test runner that knows exactly which test to run from a file and line number, or even something that turns your code into pixel art. The parser is no longer the bottleneck β€” your imagination is.

Go make something amazing!

Wondering what you can do next?

Finished this article? Here are a few more things you can do:

  • Share this article on social media
Matheus Richard

Matheus Richard

Guest author Matheus is a Brazilian software developer and a member of the Rails Issues team. He is passionate about open source and enjoys creating games and designing interpreters to explore new ideas and push his creativity.

All articles by Matheus Richard

Become our next author!

Find out more

AppSignal monitors your apps

AppSignal provides insights for Ruby, Rails, Elixir, Phoenix, Node.js, Express and many other frameworks and libraries. We are located in beautiful Amsterdam. We love stroopwafels. If you do too, let us know. We might send you some!

Discover AppSignal
AppSignal monitors your apps