An Introduction to Solid Queue for Ruby on Rails

One of the most exciting additions to Rails 8 is undoubtedly Solid Queue, a new library for processing background jobs.

You might not think it's that big of a deal. After all, there are plenty of other queuing systems out there. If you work with Rails, you'll likely know about Sidekiq and Resque — both are exceptionally performant and reliable. There is also GoodJob and the venerable DelayedJob. With all those options available, do we really need another queuing system?

Let's find out together. In this two-part series, we'll dig deep into Solid Queue's internals, discover what makes it unique, and learn more about why it was created in the first place.

Why Solid Queue for Ruby on Rails?

Since Rails 7, the team at 37Signals has been on a quest to reduce the operational overhead needed to launch a new Rails application. As part of this, they made SQLite the new default database for Rails apps - even in production. Furthermore, they started an effort to eliminate additional infrastructure dependencies to take full advantage of this new default.

37Signals had used Resque until then, and Resque requires Redis to function. So does Sidekiq, for that matter. To get rid of Redis, they had to create a queuing system that relies only on your database — and that queuing system turned out to be Solid Queue.

So that's its main selling point: No additional dependencies; just use your database. Very nice! However, as with any queuing system — and especially one that is the new Rails default — Solid Queue needs to satisfy some stringent requirements.

It must provide all the features Rails developers are used to from other background job systems. As the Rails default, it must support all databases that Rails works with. Obviously, it needs to satisfy standard safety requirements — as in, it must never, ever lose jobs. Last but not least, it must be fast enough to be a viable option for large production systems.

That's quite a tall order! So, how does Solid Queue address all those requirements?

Solid Queue From The Top

There are many details to consider, but let's start with a high-level architectural overview. You need to be aware of two significant components: Jobs and Workers.

Job is an ActiveRecord model, and what the user interacts with. Note that that's not necessarily true for other ActiveJob backends — it's just how SolidQueue implements background jobs. If you need to create a new background job, this is the class that you inherit from. Job also defines methods that enable you to enqueue work, such as Job.perform_later.

Ruby

# app/jobs/my_job.rb
class MyJob < ApplicationJob
  queue_as :default
 
  def perform
    # Do something later
  end
end

Workers, as the name suggests, are the elements that perform the actual work. These are generally not directly created by the programmer but automatically created based on how you configure your application. For example, to have your application spawn two workers listening to all and two specific queues respectively, you'd use the following configuration file:

YAML

# config/queue.yml
production:
  workers:
    - queues: "*"
    - queues: [default, critical]

Workers are spawned as processes, running in the background, waiting for jobs to be assigned to them. As you may have guessed, your database is the missing link between jobs and workers. Whenever Solid Queue does anything, one database table or other is involved. SolidQueue does a lot of things, so a lot of tables are needed.

Ruby

# lib/generators/solid_queue/install/templates/db/queue_schema.rb
ActiveRecord::Schema[7.1].define(version: 1) do
  create_table "solid_queue_jobs", force: :cascade do |t|
    # ...
  end
 
  create_table "solid_queue_ready_executions", force: :cascade do |t|
    # ...
  end
 
  create_table "solid_queue_scheduled_executions", force: :cascade do |t|
    # ...
  end
 
  create_table "solid_queue_claimed_executions", force: :cascade do |t|
    # ...
  end
 
  create_table "solid_queue_blocked_executions", force: :cascade do |t|
    # ...
  end
 
  create_table "solid_queue_failed_executions", force: :cascade do |t|
    #...
  end
 
  # Lots more tables below...
end

The Life and Death of a SOLID Job

To understand what all those tables do and how they relate to the various features of Solid Queue, let's look at the life cycle of a job. When a user enqueues a job to be executed later — let's say MyJob — a record is created in the solid_queue_jobs table. The record contains all the data required to execute the job — arguments, its name, the queue it is put in, and so forth. If the job is enqueued to run as soon as possible (rather than scheduled to run at some later point in time), an additional record is written to solid_queue_ready_executions.

For example, running MyJob.perform_later results in the following SQL:

SQL

INSERT INTO "solid_queue_jobs" ("queue_name", "class_name", "arguments", "priority", "active_job_id", "scheduled_at", "finished_at", "concurrency_key", "created_at", "updated_at")
  VALUES ('default', 'MyJob', '{"job_class": "MyJob","...",}', 0, '...', '2024-12-01 14:00:00', NULL, NULL, '2024-12-01 14:00:00', '2024-12-01 14:00:00')
  RETURNING "id"
INSERT INTO "solid_queue_ready_executions" ("job_id", "queue_name", "priority", "created_at")
  VALUES (1, 'default', 0, '2024-12-01 14:00:00')
  RETURNING "id"

Your workers poll this table for new records. A worker process that finds a new record will first claim it by writing another record to the solid_queue_claimed_executions table — we'll learn why that is necessary later. Only then will the worker actually execute the job. Below is some heavily edited code to illustrate what is happening (much more is happening in the actual code). If you are curious about the nitty-gritty details, I highly recommend you check out the original source code.

Ruby

class Worker
  def run
    loop do
      break if shutting_down?
 
      unless poll > 0
        # Polling interval is configurable and defaults to 1ms
        sleep(polling_interval)
      end
    end
  end
 
  def poll
    # Claim jobs and then execute claimed jobs.
    claim_executions.then do |executions|
      executions.each do |execution|
        # Actually execute the job
      end
    end
  end
 
  def claim_executions
    # Query the ready executions table and claim a job for execution.
    with_polling_volume do
      SolidQueue::ReadyExecution.claim
    end
  end
end

Once a worker finishes a job, it removes the corresponding records from the solid_queue_jobs, solid_queue_ready_executions, and solid_queue_claimed_executions tables. That's all there is to it — just polling some tables, creating and removing records. Not so tricky, right? It would be if there weren't critical non-functional requirements to consider, too.

On Performance

To achieve production-ready performance, Solid Queue uses ingenious database design. You may have wondered why workers poll solid_queue_ready_executions rather than solid_queue_jobs. The additional table seems redundant at first glance.

Consider that solid_queue_jobs may contain thousands or millions of records, and querying that pile of data takes time. In comparison, solid_queue_ready_executions is tiny, as it only contains records for jobs that must be executed right now! That leads to some serious speedup.

The introduction of additional tables also simplifies queries. Workers only use two different queries for polling. They either poll all queues or specific ones. That, in turn, allows for some nice covering indices.

SQL

SELECT job_id
  FROM solid_queue_ready_executions
  WHERE queue_name = "default"
  ORDER BY priority ASC, job_id ASC
  LIMIT 4
  FOR UPDATE SKIP LOCKED

Ruby

# Indices for polling solid queue ready executions
create_table "solid_queue_ready_executions", force: :cascade do |t|
  t.index [ "priority", "job_id" ], name: "index_solid_queue_poll_all"
  t.index [ "queue_name", "priority", "job_id" ], name: "index_solid_queue_poll_by_queue"
end

All that still wouldn't be enough to achieve truly outstanding performance. Traditionally, queuing systems that rely on polling tables have had a significant problem. One worker would block all others while querying and updating the polling table.

Let's take a look at why. Consider the following query:

SQL

SELECT id
  FROM jobs
  WHERE queue = "default"
  AND claimed = 0
  ORDER_BY priority, id
  LIMIT 2
  FOR UPDATE;

The FOR UPDATE statement locks the rows selected by the query. This is necessary to avoid nasty race conditions, such as multiple workers grabbing the same job. But that also means that any worker running this query would block read access to the table. Thus, other workers would have to wait for that query to finish. The polling table becomes a bottleneck that hinders rapid job execution.

Luckily, modern databases (PostgreSQL >= 9.5, MySQL >= 8.0) solve this problem. The SKIP LOCKED statement allows the database to lock only the records that are being updated. The rest of the table remains unlocked and free to be polled concurrently.

SQLite does not support SKIP LOCKED, so worker processes must queue up. In most cases, this shouldn't be an issue. SQLite writes are fast as the database is present on disk. Even so, this is a limitation that you should be aware of.

Whether you're using SQLite or another database, AppSignal provides Solid Queue performance monitoring out of the box! We'll talk more about this in part two of this series.

Safety First

We've spent some time discussing solid_queue_ready_executions, but another table is instrumental for ensuring that Solid Queue functions reliably. A key requirement of any queuing system is that any job being enqueued is executed at least once. In other words, jobs must never be lost—we already alluded to this in the introduction.

Without additional safety measures, this could quickly happen. Imagine that a worker starts working on a job and, in doing so, updates the corresponding job record to claim it. Of course, this is necessary to avoid multiple workers running a job simultaneously.

Imagine that suddenly, this worker process dies without finishing execution. Your machine might crash, and the OS may kill the worker for consuming too much memory — accidents happen, you know. The job it claimed will remain stuck forever because no other workers can grab it. Thus, it will never be executed, and your users will be sad and angry. The end.

That is, unless we add additional safety measures. Solid Queue solves this problem by introducing yet more tables — solid_queue_claimed_executions and solid_queue_processes.

Ruby

ActiveRecord::Schema[7.1].define(version: 1) do
  create_table "solid_queue_claimed_executions", force: :cascade do |t|
    t.bigint "job_id", null: false
    t.bigint "process_id"
    # ...
  end
 
  create_table "solid_queue_processes", force: :cascade do |t|
    t.datetime "last_heartbeat_at", null: false
    t.integer "pid", null: false
    #  ...
  end
  # ...
end

We've already mentioned solid_queue_claimed_executions. Let's look at what happens when a worker claims a job. For one, it sets the claimed flag in the solid_queue_jobs table. Additionally, a record is created in solid_queue_claimed_executions. This record contains the job_id of the job being claimed and the id of the worker process that makes the claim.

So, what is the solid_queue_processes table good for? Any worker process will create and periodically update a record in this table by setting last_heartbeat_at. Of course, that alone wouldn't solve our problem.

We need another process to keep track of running processes: the so-called supervisor. This process runs in the background and periodically checks solid_queue_processes. A record with a last_heartbeat_at older than a threshold — which defaults to 5 minutes — indicates that the corresponding worker has met a tragic fate.

If such a record is found, the supervisor jumps into action. First, it removes the record from solid_queue_processes. Then, it marks any jobs previously claimed by the now-deceased worker as up-for-grabs. Thus, other workers can claim them, avoiding the stuck-job situation.

More to Discover in Solid Queue

In this post, we covered a fair bit of Solid Queue's internals. We looked at its high-level architecture and how its most essential feature—enqueuing and executing a job—works under the hood. We also learned the critical role of FOR UPDATE SKIP LOCKED in performance. Finally, we learned how the supervisor process helps avoid stuck jobs.

But there is more to discover. Solid Queue offers many more features we haven't touched on, such as scheduling recurring and sequential jobs. Stay tuned as we continue our deep dive in part two of this series.