Using Supervisors to Organize Your Elixir Application

In the previous chapter of this series, we looked at hot code reloading in Elixir and why we should use GenServer to implement long-running processes.

But to organize a whole application, we need one more building block — supervisors. Let's take a look at supervisors in detail.

Defining an OTP Application

According to the documentation:

In OTP, application denotes a component implementing some specific functionality, that can be started and stopped as a unit, and that can be reused in other systems. This module interacts with application controller, a process started at every Erlang runtime system.

This module contains functions for controlling applications (for example, starting and stopping applications), and functions to access information about applications (for example, configuration parameters).

In other words, an application is a kind of package that contains reusable modules, has name, version, specific dependencies, etc.

Mix creates a new application when we run:

Shell

mix new our_new_app

But there is one crucial difference that distinguishes OTP applications from packages in other languages. OTP applications can be started and stopped and have their own running entities.

You can usually create such applications in Elixir with the command:

Shell

mix new our_new_app --sup
cd our_new_app

This creates an additional file for us: lib/our_new_app/application.ex. It implements the so-called application behavior. Its primary purpose is to implement start/2 function, which should start a supervision tree.

What Are Supervision Trees in Elixir?

So what is a supervision tree? I use the following analogy: a running OTP system is like a whole OS with its own lightweight processes. They start, work, and terminate. As in a real OS, we need a tool that helps us:

Start the system in the correct order
Handle abnormal situations when a process dies due to some errors
Stop the system correctly.

In a real OS, we have Systemd (on some Linux OSes) or launchd on MacOS. In OTP, there are supervisors and Supervisor module.

We can organize our processes in the following way using supervisors:

Leaf processes in this scheme are generally GenServer or similar processes.

As in Systemd, if a process fails, we can choose to do nothing. Another option is to restart the process over and over again until it completes normally, or together with sibling processes.

Earlier, in Erlang, it was tricky to build supervision trees, but Elixir helps us a lot with this.

It's also worth noting that there is some good documentation available about supervisors.

Connecting GenServers to a Supervision Tree

Let's again look at what mix created for us in lib/our_new_app/application.ex:

Elixir

  def start(_type, _args) do
    children = [
      # Starts a worker by calling: OurNewApp.Worker.start_link(arg)
      # {OurNewApp.Worker, arg}
    ]
 
    # See https://hexdocs.pm/elixir/Supervisor.html
    # for other strategies and supported options
    opts = [strategy: :one_for_one, name: OurNewApp.Supervisor]
    Supervisor.start_link(children, opts)
  end

It starts a supervisor and clearly shows how to run our worker as a child.

Let's do that. We will be able to increment a given number periodically and report its state on demand in our sample process.

First, create a GenServer in lib/our_new_app/counter.ex:

Elixir

defmodule OurNewApp.Counter do
  use GenServer
  require Logger
 
  @interval 100
 
  def start_link(start_from, opts \\ []) do
    GenServer.start_link(__MODULE__, start_from, opts)
  end
 
  def get(pid) do
    GenServer.call(pid, :get)
  end
 
  def init(start_from) do
    st = %{
      current: start_from,
      timer: :erlang.start_timer(@interval, self(), :tick)
    }
 
    {:ok, st}
  end
 
  def handle_call(:get, _from, st) do
    {:reply, st.current, st}
  end
 
  def handle_info({:timeout, _timer_ref, :tick}, st) do
    new_timer = :erlang.start_timer(@interval, self(), :tick)
    :erlang.cancel_timer(st.timer)
 
    {:noreply, %{st | current: st.current + 1, timer: new_timer}}
  end
end

This server increments a given number every 100ms and can report its state via OurNewApp.Counter.get/1:

Shell

iex -S mix
...
iex(1)> {:ok, pid} = OurNewApp.Counter.start_link(10000)
{:ok, #PID<0.182.0>}
iex(2)> OurNewApp.Counter.get(pid)
10136
iex(3)> OurNewApp.Counter.get(pid)
10146

Now let's integrate our server as a child. Update start/2 function in lib/our_new_app/application.ex to the following:

Elixir

  def start(_type, _args) do
    children = [
      {OurNewApp.Counter, 10000}
    ]
 
    opts = [strategy: :one_for_one, name: OurNewApp.Supervisor]
    Supervisor.start_link(children, opts)
  end

We see that our process starts automatically:

Shell

iex -S mix
...
iex(1)> [{_, pid, _, _}] = Supervisor.which_children(OurNewApp.Supervisor)
[{OurNewApp.Counter, #PID<0.141.0>, :worker, [OurNewApp.Counter]}]
iex(2)> OurNewApp.Counter.get(pid)
10119
iex(3)> Process.exit(pid, :shutdown)
true
iex(4)> Supervisor.which_children(OurNewApp.Supervisor)
[{OurNewApp.Counter, #PID<0.146.0>, :worker, [OurNewApp.Counter]}]

We queried the supervisor's children with Supervisor.which_children/1. We also see that our counter process restarted after we stopped it.

Our process tree now looks like this:

Adding GenServers to Custom Supervisors

Now let's make a special supervisor for our counter processes. Later, we'll see why we may want to do that. Our supervision tree will look like this:

Supervision Tree with Counters and their own supervisor

First, we should make a callback module for our new special supervisor. Let's add lib/our_new_app/counter_sup.ex with the following content:

Elixir

defmodule OurNewApp.CounterSup do
  use Supervisor
 
  def start_link(start_numbers) do
    Supervisor.start_link(__MODULE__, start_numbers, name: __MODULE__)
  end
 
  @impl true
  def init(start_numbers) do
    children =
      for start_number <- start_numbers do
        # We can't just use `{OurNewApp.Counter, start_number}`
        # because we need different ids for children
 
        Supervisor.child_spec({OurNewApp.Counter, start_number}, id: start_number)
      end
 
    Supervisor.init(children, strategy: :one_for_one)
  end
end

We must also update children for the main application supervisor in lib/our_new_app/application.ex:

Elixir

  def start(_type, _args) do
    children = [
      {OurNewApp.CounterSup, [10000, 20000]}
    ]
 
    opts = [strategy: :one_for_one, name: OurNewApp.Supervisor]
    Supervisor.start_link(children, opts)
  end

Let's see what we get:

Shell

iex -S mix
...
iex(1)> Supervisor.which_children(OurNewApp.Supervisor)
[{OurNewApp.CounterSup, #PID<0.161.0>, :supervisor, [OurNewApp.CounterSup]}]
iex(2)> Supervisor.which_children(OurNewApp.CounterSup)
[
  {20000, #PID<0.163.0>, :worker, [OurNewApp.Counter]},
  {10000, #PID<0.162.0>, :worker, [OurNewApp.Counter]}
]

That's just what we need: OurNewApp.Supervisor has OurNewApp.CounterSup as its child and OurNewApp.CounterSup has two OurNewApp.Counter children.

Many developers consider custom supervisors tricky and avoid using them. So let's do some simple exercises to get more acquainted with them.

First, we'll add a third counter to our counter supervisor at runtime:

Shell

 
iex(3)> new_child_spec = Supervisor.child_spec({OurNewApp.Counter, 30000}, id: 30000)
%{id: 30000, start: {OurNewApp.Counter, :start_link, [30000]}}
iex(4)> Supervisor.start_child(OurNewApp.CounterSup, new_child_spec)
{:ok, #PID<0.169.0>}
iex(5)> Supervisor.which_children(OurNewApp.CounterSup)
[
  {30000, #PID<0.169.0>, :worker, [OurNewApp.Counter]},
  {20000, #PID<0.163.0>, :worker, [OurNewApp.Counter]},
  {10000, #PID<0.162.0>, :worker, [OurNewApp.Counter]}
]

That was easy! With Supervisor.delete_child/2, Supervisor.restart_child/2, etc., we can easily manipulate the supervisor's children.

Secondly, instead of adding one worker to the existing tree, let's try adding a subtree with its own children (without a special module for the subtree supervisor):

Shell

iex(6)> children_specs = for n <- [10000, 20000, 30000], do: Supervisor.child_spec({OurNewApp.Counter, n}, id: n)
[
  %{id: 10000, start: {OurNewApp.Counter, :start_link, [10000]}},
  %{id: 20000, start: {OurNewApp.Counter, :start_link, [20000]}},
  %{id: 30000, start: {OurNewApp.Counter, :start_link, [30000]}}
]
iex(7)> hand_crafted_sup_spec = %{
...(7)>     id: :hand_crafted_sup,
...(7)>     start: {Supervisor, :start_link, [children_specs, [strategy: :one_for_one]]},
...(7)>     type: :supervisor,
...(7)>     restart: :permanent,
...(7)>     shutdown: 5000
...(7)> }
...
iex(8)> Supervisor.start_child(OurNewApp.Supervisor, hand_crafted_sup_spec)
{:ok, #PID<0.204.0>}
iex(9)> Supervisor.which_children(OurNewApp.Supervisor)
[
  {:hand_crafted_sup, #PID<0.204.0>, :supervisor, [Supervisor]},
  {OurNewApp.CounterSup, #PID<0.161.0>, :supervisor, [OurNewApp.CounterSup]}
]

The following took place:

hand_crafted_sup_spec was constructed, which started Supervisor.start_link
We told our main supervisor to start a child with this spec
The main supervisor started with children_specs parameters
It started counters from children_specs.

We could do this in another way: tell our main supervisor to launch an empty child supervisor, then add counters one by one to this child supervisor.

The process tree at the end of the experiment should look like this:

Supervision Tree with handcrafted supervisor

Examples of Custom Supervisor Usage

Let's see what happens if we terminate our app.

First, add some logging to lib/our_new_app/counter.ex:

Elixir

  def terminate(reason, st) do
    Logger.info("terminating with #{inspect(reason)}, counter is #{st.current}")
  end

Also enable the :trap_exit flag for our counters, so that we can handle process termination — see terminate callback documentation:

Elixir

  def init(start_from) do
    Process.flag(:trap_exit, true)
 
    st = %{
    ...
  end

Now, if we stop our application in the iex session, we see:

Shell

iex -S mix
...
iex(1)> Application.stop(:our_new_app)
19:35:43.544 [info]  terminating with :shutdown, counter is 20049
19:35:43.548 [info]  terminating with :shutdown, counter is 10050
19:35:43.548 [info]  Application our_new_app exited: :stopped
:ok

Imagine that we have to implement a graceful shutdown. The condition of gracefulness is to count up until we reach numbers divisible by 10 (10, 20, 30, etc) before shutdown.

Of course, in our simple example, we may just send ticks to count to the nearest number divisible by 10 in terminate.

Instead, imagine that these events are external end emulate some metrics that we would prefer to aggregate consistently.

First, let's add the possibility of a graceful restart to the OurNewApp.Counter module:

Elixir

defmodule OurNewApp.Counter do
  use GenServer
  require Logger
 
  @interval 100
 
  def start_link(start_from) do
    GenServer.start_link(__MODULE__, start_from)
  end
 
  def get(pid) do
    GenServer.call(pid, :get)
  end
 
  def stop_gracefully(pid) do
    GenServer.call(pid, :stop_gracefully)
  end
 
  def init(start_from) do
    Process.flag(:trap_exit, true)
 
    st = %{
      current: start_from,
      timer: :erlang.start_timer(@interval, self(), :tick),
      terminator: nil
    }
 
    {:ok, st}
  end
 
  def handle_call(:get, _from, st) do
    {:reply, st.current, st}
  end
 
  def handle_call(:stop_gracefully, from, st) do
    if st.terminator do
      {:reply, :already_stopping, st}
    else
      {:noreply, %{st | terminator: from}}
    end
  end
 
  def handle_info({:timeout, _timer_ref, :tick}, st) do
    :erlang.cancel_timer(st.timer)
 
    new_current = st.current + 1
 
    if st.terminator && rem(new_current, 10) == 0 do
      # we are terminating
      GenServer.reply(st.terminator, :ok)
      {:stop, :normal, %{st | current: new_current, timer: nil}}
    else
      new_timer = :erlang.start_timer(@interval, self(), :tick)
      {:noreply, %{st | current: new_current, timer: new_timer}}
    end
  end
 
  def terminate(reason, st) do
    Logger.info("terminating with #{inspect(reason)}, counter is #{st.current}")
  end
end

Here we:

Add a terminator field to the state that keeps the address of the party that wants to stop the server
Set this field in the stop_gracefully handler
Continue counting until we get to a number divisible by 10
Respond to the terminating party and stop the server upon obtaining this number.

Let's see how that works for a single process:

Shell

iex -S mix
Erlang/OTP 23 [erts-11.0.4] [source] [64-bit] [smp:16:16] [ds:16:16:10] [async-threads:1] [hipe]

iex(1)> {:ok, pid} = OurNewApp.Counter.start_link(10000)
{:ok, #PID<0.167.0>}
iex(2)> OurNewApp.Counter.stop_gracefully(pid)
:ok
iex(3)>
20:03:13.061 [info]  terminating with :normal, counter is 10120

Everything works fine. But what stops all counters gracefully? As we see in the OTP docs, OurNewApp.Application.prep_stop is called (if it exists) before the application stops.

Let's add the desired functionality:

Elixir

  @impl true
  def prep_stop(st) do
    stop_tasks =
      for {_, pid, _, _} <- Supervisor.which_children(OurNewApp.CounterSup) do
        Task.async(fn ->
          :ok = OurNewApp.Counter.stop_gracefully(pid)
        end)
      end
 
    Task.await_many(stop_tasks)
 
    st
  end

We also set :restart option to :transient in OurNewApp.CounterSup so that our counters do not restart after graceful shutdown:

Elixir

Supervisor.child_spec({OurNewApp.Counter, {start_number, 200}},
  id: start_number,
  restart: :transient
)

Try to stop the app:

Shell

iex -S mix
iex(1)> Application.stop(:our_new_app)
 
20:24:02.958 [info]  terminating with :normal, counter is 10260
 
20:24:02.958 [info]  terminating with :normal, counter is 20260
 
20:24:02.962 [info]  Application our_new_app exited: :stopped
:ok

Now we only stop at numbers divisible by 10.

Using a special supervisor for our counters makes it possible to "find" all the instances quickly and operate with them. This is extremely important when applications go through start/stop stages.

Wrap-up

I hope you've managed to wrap your head around supervisors in Elixir, and have found this article helpful, alongside the previous article on hot code reloading.

There is another, even more complicated stage in an application life cycle: application code upgrades. We'll leave that for the third and final part of this series.

Until then, enjoy coding!

P.S. If you'd like to read Elixir Alchemy posts as soon as they get off the press, subscribe to our Elixir Alchemy newsletter and never miss a single post!