If you've used Ruby before, background job queues are a tool you might reflexively reach for when building applications in Elixir. While such solutions exist, you may find that they aren't used that much in Elixir, which might leave you wondering why. I mean, don't Elixir applications perform asynchronous jobs? Yes! But it's thought about a bit differently.
Today we'll look at the different solutions Elixir has for performing background work, and when you would use which. Along the way, we'll learn how Elixir's native constructs make background work a beautiful matter of spawning a process.
Some Background: Background Job Queues
Almost all production Rails applications use a background job framework such as Sidekiq. This is used to do a few things:
- Defer expensive work so that users aren't kept waiting, e.g. when setting up an account with data fetched from an external source
- Carry out ongoing background work that wasn't initiated by a user, e.g. fetching the current value of a cryptocurrency
- Speed up response times by moving nonessential work (e.g. sending a welcome email) out of the user request cycle
If we were to do this in Elixir, how would we handle it in a way that made use of Elixir's strengths?
Elixir has several tools for performing background work. The main ones are: Supervisors, GenServers, and Tasks. Whichever one you chose to use depends on your application and use case. Let's dive into when you would use each one!
Deferring Work
In the first example, let's say we have a system that connects to two external data sources to complete a user's account setup: we use a user-supplied address to find the order fulfillment center closest to their location, and then we send user-supplied name and birthday to a separate CRM. We then store fulfillment_center_id
and crm_id
keys on the user in our main database so that we can fetch those external records at some later time.
The key aspects of this operation are that 1) the two pieces of work can be done independently of each other and 2) we want the work to retry in case it fails. We also aren't concerned about the return values for these tasks—they write their side effects to the database outside the initial thread of operations. For this, we would use supervised Tasks.
We define a Task.Supervisor
that starts when our application launches. Here, we name it YourApp.AccountSetupSupervisor
, which gives it a nice semantic meaning for our use case. Your system can have many Task.Supervisors
for managing different types of tasks.
Here's our task supervisor with the key function set_up_user_account/1
. It's called by either a controller or account context to complete the account setup once the basic user data has been saved to the database. The function spawns two tasks, with each task executing the specified function for connecting to our CRM and our Fulfillment Service.
The restart: :transient
option tells the supervisor to restart the task if it exits abnormally, for example, when a connection fails and the process crashes. By default, the supervisor will try restarting the process up to 3 times in 5 seconds before giving up. If the operation is successful, the process exits normally and everything goes on as usual.
This code expects us to have a module YourApp.CRM
with a function create_user/1
and a module YourApp.Fulfillment
with a function set_nearest_location/1
. Here's what one of those modules might look like:
YourApp.Fulfillment
would look similar. Structuring our code this way makes for clean interfaces, single responsibilities, and reusable components.
Now the user won't experience any lags while the system is hard at work. The controller immediately returns to the user, who can see a helpful screen while the system continues doing its thing in the background. Sweet as stroopwafels!
Ongoing Background Work
Okay, what about ongoing background work which happens regularly and is not initiated by a user? I would use an ordinary GenServer for this. Two things that make GenServers great are that you don't have to serialize their arguments like you do with Sidekiq, and thanks to the BEAM's fair scheduling, the resource-intensive processes will not be a huge drain on your user-facing responsiveness. Plus, you can easily monitor what your Genserver is doing with the help of the observer
by dropping some useful statistics into its state
.
Let's dive into the cryptocurrency account value example. Let's say our frontend is wired up to update the DOM in real-time via Phoenix LiveView. We need to ping our node every second for the latest data, then use that data to refresh what the user sees. For such recurring tasks, I use a technique where I start up a GenServer during application startup. The GenServer requests the BEAM to send it a work
message after 1000ms and then goes to sleep. After the interval elapses, the BEAM sends the message to the GenServer, which wakes up, does the work, schedules another message to be sent in 1000ms, and goes back to sleep.
The key piece is that sleeping processes don't affect system performance, so you can have many of these schedulers all sleeping at the same time, perhaps one for each user.
Here, our GenServer handles both the scheduling and the work itself, which can produce some drift in our timing. This may be fine for your use case, but, if the work must happen on timed intervals, have the scheduling GenServer create a new process for doing the work, leaving it only responsible for setting the timer.
Note that if you spin up a GenServer for each user, you can end up hammering the external resource with requests for the latest data. Remember, GenServers are all working concurrently! Scheduling and concurrency are easy in Elixir—perhaps too easy. As the programmer, you must still understand how your system works and whether concurrency is right for you.
Moving Nonessential Work
Now onto the third scenario where we want to speed up response times to the user by pulling out nonessential work. The key word is "nonessential" which includes operations that are one-offs or fire-and-forget. If a welcome email doesn't get sent to the user or a single data point is dropped, that's okay. You can simply wrap your code in a Task.start
block and that's it!
The disadvantage of this approach is that it doesn't give you the option to retry in case of failure. For critical work, start a supervisor and have it spawn a task that can be restarted.
Other Considerations and RabbitMQ
Don't take this to mean that you should never use a background job framework. There are situations where it makes sense to use one. For example, if your system reboots in the middle of a job, the work might be lost. In such cases, you need to have in place an independent system that survives application restarts. A message broker like RabbitMQ would be the way to go. But that's a topic for another day. We could (and might) write a whole blog about it.
Also, be thoughtful when it comes to retrying jobs. Does it really need to be retried? Often, when failures rack up in our Sidekiq dashboard, we just clear them out. Important retries are usually done manually, or at least under close supervision by an engineer, instead of being blindly requeued.
Conclusion
A background job system has many moving pieces and thus, points of failure. They may be worth it, but we should not build one into our Elixir application until we have thoughtfully examined our use cases. Erlang's process model gives us a diverse toolkit for solving these sorts of problems and we should not make the mistake of writing Elixir code as though it's Ruby. It's not. The more we learn about processes, the better and more idiomatic our architectural decisions will be. We hope this was a fun step in that learning path.