elixir

# Best Practices for Background Jobs in Elixir

Miguel Palhas on

Erlang & Elixir are ready for asynchronous work right off the bat. Generally speaking, background job systems aren't needed as much as in other ecosystems but they still have their place for particular use cases.

This post goes through a few best practices I often try to think of in advance when writing background jobs, so that I don't hit some of the pain points that have hurt me multiple times in the past.

If you've ever deployed a new task, only to find out that it has gone rogue with a bug that caused it to misbehave (e.g.: sending way too many emails, way too quickly), you may have gone through similar bugs as well.

## Flavours

Elixir already gives you the ability to schedule asynchronous work pretty easily. Something as simple as this already covers a lot:

Task.async(fn ->
# some heavy lifting
end)

You might need something a bit more powerful, either just for convenience (having some tooling & monitoring around that task), or because you need something like periodic jobs. Again, all this can be achieved with something like a GenServer:

defmodule PeriodicJob do
use GenServer

@period 60_000

def init do
Process.send_after(self(), :poll, @period)

{:ok, :state}
end

def handle_info(:poll, state) do
# some heavy lifting

Process.send_after(self(), :poll, @period)
end
end

You can also use a job queuing library such as Quantum. If you come from Ruby land and are used to libraries such as Sidekiq, you might be more familiar with something like this:

#
# lib/my_app/scheduler.ex
#
defmodule MyApp.Scheduler do
use Quantum.Scheduler, opt_app: :my_app
end

#
# config/config.exs
#
config :my_app, MyApp.Scheduler,
jobs: [
first: [
# every hour
schedule: "0 * * * *",
task: {MyApp.ExampleJob, :run, []}
],
second: [
# every minute
schedule: "* * * * *",
task: {MyApp.AnotherExampleJob, :run, []}
]
]

Some may argue that since Erlang/OTP already provides the infrastructure for creating these processes, packages such as Quantum are not necessary. However, the structure created by them can end up being more intuitive, especially if you're not that familiar with OTP. This might be the case with someone coming from Ruby or other such communities.

## How to Structure Background Jobs

Let's now get into a few tips that will help you keep your jobs ready to deal with potential future problems!

Most of them are preventive measures due to the fact that all of these are background processes. They're not responding to an HTTP request and they happen without any intervention, thus sometimes, debugging can be hard if you don't take some precautions.

Let's consider a small example that sends confirmation emails to users that haven't received it yet:

defmodule MyApp.ExampleJob do
def run do
get_users()
|> Enum.each(fn user ->
# send single email to user
end)
end

defp get_users do
MyApp.User
|> where(confirmation_email_sent: false)
|> MyApp.Repo.all()
end
end

### 1. Put in a Kill Switch

This is one of those mistakes I'll never make again since it has hurt me so many times.

Let's say you've created a background job, tested, deployed, and configured it to run periodically and send some emails.

It hits production, and you soon notice that something's wrong. The same 100 people are being spammed with emails every minute. You messed up the geth_next_batch/1 function, and it always goes over the same batch of users. It's a developer's horror story. You need to fix it (or kill it) quickly, but all that time waiting for a new release to get online is physically painful.

So, avoid that:

defmodule MyApp.ExampleJob do
def run do
return if !enabled?

# ...
end

defp enabled? do
# check a Redis flag, or a database record, or anything really
end
end

You can plug in some persistent system that allows you to quickly toggle the job on/off. A good suggestion would be to use a feature flag package, such as FunWithFlags.

### 2. Always Batch Your Jobs by Default

It's easy to miss this one on a first draft. You're just trying to quickly get something online. But in some cases, it may be important to not hurt your performance if you're working on a very resource-intensive job, or simply, if your list of records to process grows too quickly.

Doing User |> where(confirmation_email_sent: false) |> Repo.all() can be dangerous if there's potential for that to yield too many results. You may end up consuming too many resources for something that could be done in smaller batches, keeping your system a lot more stable:

defmodule MyApp.ExampleJob do
@batch_size 100

defp get_users do
MyApp.User
|> where(confirmation_email_sent: false)
|> limit(^@batch_size)
|> MyApp.Repo.all()
end
end

Whatever job queue mechanism you plug this worker into, it will end up being called frequently. So you shouldn't hurry in processing smaller batches one at a time.

### 3. Avoid Overlaps

This is kind of related to the previous point, but it's a concern that goes beyond performance.

If you program a job to run every minute, and a single execution has the potential to last longer than that, you end up risking cascading performance problems, or even worse, race conditions, where the first and second executions are both trying to process the same set of data, and conflict with each other in the process.

This is obviously dependent on what your exact business logic is, but as a general rule, it's best to be defensive here.

If you use a GenServer approach like the one showcased above, this is solved automatically, as instead of scheduling jobs every minute, you can instead use Process.send_after(self(), :poll, delay) to only schedule the next run after the current one has finished, avoiding overlap.

When using Quantum, you also have an overlap: true option that you can add to automatically prevent this:

config :my_app, MyApp.Scheduler,
jobs: [
first: [
# every hour
schedule: "0 * * * *",
task: {MyApp.ExampleJob, :run, []}
overlap: false
]


This, by the way, might already be reason enough to consider using a package rather than just plain Elixir.

### 4. Plug in a Manual Mode

If your job is processing a batch of records, it's useful to plug in some public functions that allow you to manually process specific records. This can serve two purposes:

• Better ability to debug the job
• Ability to do a few manual runs before enabling the global job (by toggling the feature flag discussed above)

A sample structure could look like this:

defmodule MyApp.ExampleJob do
import MyApp.Lock

def run do
lock("example_job", fn ->
get_users()
|> Enum.each(&process_user/1)
end)
end

def run_manually(users) when is_list(users) do
lock("example_job", fn ->
users
|> Enum.each(&process_user/1)
end)
end

def run_manually(user), do: run_manually([user])

def process_user(%User{} = user) do
# process a single user
end
end

In this case, we're creating a run_manually/1 public function that can receive either a single user or a batch of them and performs the same logic as the automatic job would.

One important detail here is to again avoid a race condition, which in this case, is being done with a custom Lock module that uses the redis_mutex package to prevent potential issues:

defmodule MyApp.Lock do
use RedisMutex
require Logger

def lock(lock_name, fun) do
with_lock(lock_name, 60_000) do
fun.()
end
rescue
_e in RedisMutex.Error ->
Logger.debug("#{locker_name} another process already running")
end
end

The lock, which is invoked both on manual runs as well as the regular background job, ensures that you won't cause any unintentional conflicts if you try to do a manual run at the same time the job is doing the same processing. It also happens to solve the overlap problem discussed previously in this post.

## Conclusion

All these tips come from something that I bumped into in the past, usually related to production bugs or user complaints. So I hope some of them help you avoid the same mistakes. Let me know if you have any further thoughts! 👋

P.S. If you'd like to read Elixir Alchemy posts as soon as they get off the press, subscribe to our Elixir Alchemy newsletter and never miss a single post!