Elixir v1.14 shipped earlier this month with a bunch of new goodies.
In this post, we'll explore Elixir's new PartitionSupervisor
. We'll take a look at some code that suffers from the exact bottleneck issue that partitions supervisors are designed to solve. Then, we'll fix that bottleneck. Along the way, you'll learn how partition supervisors work under the hood to prevent process bottlenecks.
Let's get started!
The Problem: Dynamic Supervisor Bottlenecks in Your Elixir App
You may be familiar with using dynamic supervisors to start child processes on-demand in your application. Telling the dynamic supervisor to start its children can become a bottleneck, however.
With just one dynamic supervisor, concurrent processes each have to wait their turn to tell your app to start up children. If the "start up child" process takes a long time, or your app handles a very high volume of "tell the dynamic supervisor to start children" requests, you've got a bottleneck. The single dynamic supervisor can only initialize one child process at a time.
Let's take a look at a simple example that artificially replicates such a bottleneck.
The Slow Worker and Its Dynamic Supervisor
In my sample app here, I have a worker - SlowGreeter.Worker
. The worker module is a genserver that initializes by sleeping for five seconds and then printing out a greeting:
Disclaimer: This is meant to replicate a genserver with some time-consuming work to do on initialization. It should be noted that a "slow-starting" genserver is something of a code smell. You should avoid doing time-consuming work in the #init/1
function and leverage handle_continue/2
to prevent long-running tasks from blocking genserver start up.
But, sometimes doing such work when the genserver starts up is unavoidable. And regardless, this setup helps us replicate a bottleneck scenario that can happen:
1. When your dynamic supervisor is either starting up a slow-to-initialize worker, or
2. If your app is handles many concurrent requests telling the same dynamic supervisor to start up its children.
Okay, with that disclaimer out of the way, let's proceed with illustrating our bottleneck. We have our slow-to-initialize worker, and we also have a SlowGreeter.DynamicSupervisor
module that starts it up:
The SlowGreeter.DynamicSupervisor
starts when the application starts up. Then, we can tell it to start the greeter worker like this:
Now we're ready to demonstrate our bottleneck.
Demo: The Bottleneck
Let's spawn five processes, each of which will make a call to SlowGreeter.DynamicSupervisor.start_greeter/1
. Since each of our spawned processes requests the same dynamic supervisor, we'll see that each process must wait its turn.
When the SlowGreeter.DynamicSupervisor
starts up its child SlowGreeter.Worker
, it must wait for the worker to initialize. The worker initialization process is slow--we told it to sleep for five seconds and then print out a greeting. We'll see that each spawned process's call to the dynamic supervisor is only processed when the preceding one finishes. As a result, each greeting prints out at a five-second interval, as you can see in this video:
This approach is bottlenecked by the dynamic supervisor itself. It can only process one "start up child" request at a time, and must wait until the initialization of a child worker is complete before moving on to the next one.
We can remove this bottleneck with the help of Elixir 1.14's new PartitionSupervisor
. Let's implement it now. Then, we'll take a deeper dive into how it works.
The Fix: Supervise a Fleet of Dynamic Supervisors with a Partition Supervisor
Instead of initializing just one dynamic supervisor when our application starts up, we'll use the new PartitionSupervisor
to spin up a set of dynamic supervisors.
Then, instead of making direct calls to the dynamic supervisor to start its child worker, we will go through the partition supervisor, which will route the request to one of the many dynamic supervisors it's overseeing. Let's take a look at the code.
Using PartitionSupervisor
in the Dynamic Supervisor Module
We'll implement a new module, SlowGreeter.DynamicSupervisorWithPartition
, which uses the DynamicSupervisor
behaviour. This time, however, the #start_greeter/1
function will call on the dynamic supervisor to start its child via the PartitionSupervisor
.
Here's where the magic happens:
Instead of calling on the SlowGreeter.DynamicSupervisorWithPartition
directly, we call it through the partition supervisor using the {:via, PartitionSupervisor, {dynamic_supervisor_name, routing_key}}
tuple.
This will route the "start child request" to one of the running SlowGreeter.DynamicSupervisorWithPartition
processes supervised by the partition supervisor that starts up with our application. Let's configure our app to start such a supervisor now:
With this, when the app starts, a partition supervisor will start up and begin supervising a set of SlowGreeter.DynamicSupervisorWithPartition
dynamic supervisors. The default behavior is to create a partition for each available scheduler (usually one per core of your machine). It will then place a dynamic supervisor on each partition.
With our code in place, let's see a demo of our bottleneck fix.
Demo: No More Bottleneck
Once again, we'll spawn five processes. Each process will call on the SlowGreeter.DynamicSupervisorWithPartition#start_greeter/1
function, like this:
This function goes through the partition supervisor to route a "start child worker" request to dynamic supervisors across partitions. This means we don't have to wait for a single dynamic supervisor to finish initializing a child worker before moving on to process the next request from a spawned process.
Instead, the partition supervisor will route these requests so that they're processed more or less concurrently by the dynamic supervisors spread across partitions. So, we'll see that all five greetings are printed out simultaneously after just one five-second sleep. You can see this behavior in the video below:
With that, we've fixed our bottleneck! Remember that our worker's "sleep for five seconds on init" behavior is artificial, and you'll want to avoid lengthy worker startups in general. But, this example serves to illustrate the kind of bottleneck that can occur in your application if you deal with lots of concurrent requests to the same dynamic supervisor.
Elixir 1.14's new partition supervisor neatly solves this problem by partitioning your dynamic supervisors and distributing requests to them, so that they can handle a greater load.
Where you don't need to share state between child processes supervised by dynamic supervisors, this approach can help your dynamic supervisors handle scale and avoid bottlenecks.
Now that we've seen partition supervisors in action, let's take a closer look at how they work.
PartitionSupervisor
in Elixir 1.14: Under the Hood
Keep reading for a deeper dive into how partition supervisors establish partitions and route requests to dynamic supervisors.
Partitioning Dynamic Supervisors
As we saw earlier, you can tell your app to start a partition supervisor, overseeing some dynamic supervisors, in your application.ex
's #start/2
function like this:
Here, we start a partition supervisor that starts a SlowGreeter.DynamicSupervisorWithPartition
for each core in our machine. This is the default start_link
behavior.
PartitionSupervisor#start_link/1
supports options that include a :partitions
key. The :partitions
key should be set to a positive integer representing the number of partitions and defaults to System.schedulers_online()
, which should return the number of cores on your machine.
Under the hood, a partition supervisor is just a regular supervisor that generates a child spec for each partition and then stores the partition info in ETS or a Registry. Later, when you tell a dynamic supervisor to start its child via the partition supervisor, it selects a partition based on the routing algorithm and routes the call to a dynamic supervisor in that partition. So, it effectively spins up a set of dynamic supervisors when your app comes up, and then starts child processes across those dynamic supervisors.
Let's take a closer look at the routing behavior now.
Routing Requests to Dynamic Supervisors
When you're ready to start a dynamic supervisor's child through a partition supervisor, you do so like this:
The :via
tuple specifies the name of the dynamic supervisor to start up and a routing key.
In our example app, we used self()
as a routing key, which will return the PID of the current process, but you can also provide a positive integer here. If key is an integer, it is routed using rem(abs(key), num_partitions)
. Otherwise, the routing is performed by calling :erlang.phash2(key, num_partitions)
. So, when you use the same PID as a routing key, you can be assured that your call will be directed to the same partition every time.
Wrap Up
We've seen how you can use partition supervisors to avoid bottlenecks in your dynamic supervisor code. Provided the state of your child processes can be partitioned (i.e., you don't need to share state among them), you can reach for a partition supervisor to elegantly scale up your dynamic supervisors.
In fact, partition supervisors can be used to partition and scale any process in your Elixir application. If a process can be started up, and have its state partitioned, then you can use a partition supervisor. Start the partition supervisor with its child spec (just like we did for our dynamic supervisor) and dispatch messages to the partition supervisor's children using the same :via
tuple we demonstrated.
This is just one of the many exciting new features of Elixir v1.14. Read about another new feature, debugging with dbg()
!
P.S. If you'd like to read Elixir Alchemy posts as soon as they get off the press, subscribe to our Elixir Alchemy newsletter and never miss a single post!