Supervisors: Building fault-tolerant Elixir applications

We briefly touched on supervision when we talked about processes in the first edition of Elixir Alchemy. In this edition, we'll take it a step further by explaining how supervision works in Elixir, and we'll give you an introduction into building fault tolerant applications.

A phrase you'll likely run into when reading up on fault tolerance in Elixir and Erlang is "Let it crash". Instead of preventing exceptions from happening, or catching them immediately when they occur, you're usually advised not to do any defensive programming. That might sound counter-intuitive — how do crashing processes help with building fault tolerant applications? Supervisors are the answer.

Fault tolerance

Instead of taking down the whole system when one of its components fail, fault tolerant applications can recover from exceptions by restarting the affected parts while the rest of the system keeps running.

In Elixir, supervisors are tasked with restarting processes when they fail. Instead of trying to handle all possible exceptions within a process, the "Let it crash"-philosophy shifts the burden of recovering from such failures to the process' supervisor.

The supervisor makes sure the process is restarted if needed, bringing it back to its initial state, ready to accept new messages.

Supervisors

To see how supervisors work, we'll use a GenServer with some state. We'll implement a cast to store a value, and a call to retrieve that value later.

When started, our GenServer sets its initial state to :empty and registers itself by the name :cache, so we can refer to it later.

Elixir

# lib/cache.ex
defmodule Cache do
  use GenServer
 
  def start_link() do
    GenServer.start_link(__MODULE__, :empty, [name: :cache])
  end
 
  def handle_call(:get, _from, state) do
    {:reply, state, state}
  end
 
  def handle_cast({:save, new}, _state) do
    {:noreply, new}
  end
end

Let's jump into IEx to supervise our GenServer. The Supervisor has to be started with a list of workers. In our case, we'll use a single worker with the module name (Cache), and an empty list of arguments (because Cache.start_link/0 doesn't take any).

Shell

$ iex -S mix
iex(1)> import Supervisor.Spec
Supervisor.Spec
iex(2)> {:ok, _pid} = Supervisor.start_link([worker(Cache, [])], strategy: :one_for_one)
{:ok, #PID<0.120.0>}
iex(3)> GenServer.call(:cache, :get)
:empty
iex(4)> GenServer.cast(:cache, {:save, :hola})
:ok
iex(5)> GenServer.call(:cache, :get)
:hola

If the process crashes, our supervisor will automatically restart it. Let's try that by killing the process manually.

Shell

...
iex(6)> pid = Process.whereis(:cache)
#PID<0.121.0>
iex(7)> Process.exit(pid, :kill)
true
iex(8)> GenServer.call(:cache, :get)
:empty
iex(9)> Process.whereis(:cache)
#PID<0.127.0>

As you can see, the :cache process was restarted by our supervisor immediately when it crashed, and getting its value revealed that it returned to its initial state (:empty).

Dynamic Supervisor

In our first example, the process we supervised was built to run indefinitely. In some cases, however, you'd want your application to spawn processes when needed, and shut them down when their work is done.

Imagine that we want to track football matches. When a match starts, we'll start a process. We'll send messages to that process to update the score, and this process will live until the match ends.

To try this out, we'll define another GenServer named FootballMatchTracker, which we can use to store and fetch the current score for both teams.

Elixir

# lib/football_match_tracker.ex
defmodule FootballMatchTracker do
  def start_link([match_id: match_id]) do
    GenServer.start_link(__MODULE__, :ok, [name: match_id])
  end
 
  def new_event(match_id, event) do
    GenServer.cast(match_id, {:event, event})
  end
 
  def get_score(match_id) do
    GenServer.call(match_id, :get_score)
  end
 
  def init(:ok) do
    {:ok, %{home_score: 0, away_score: 0}}
  end
 
  def handle_call(:get_score, _from, state) do
    {:reply, state, state}
  end
 
  def handle_cast({:event, event}, state) do
    new_state =
      case event do
        "home_goal" -> %{state | home_score: state[:home_score] + 1}
        "away_goal" -> %{state | away_score: state[:away_score] + 1}
        "end" -> Supervisor.terminate_child(:football_match_supervisor, self())
      end
    {:noreply, new_state}
  end
end

Next, we'll implement a supervisor for FootballMatchTracker.

Elixir

# lib/football_match_supervisor.ex
defmodule FootballMatchSupervisor do
  use Supervisor
 
  def start_link do
    Supervisor.start_link(__MODULE__, [], [name: :football_match_supervisor])
  end
 
  def init([]) do
    children = [
      worker(FootballMatchTracker, [], restart: :transient)
    ]
 
    supervise(children, strategy: :simple_one_for_one)
  end
end

Each FootballMatchTracker will be registered with a match identifier that will be given to it through its initialization. Since the Supervisor behaviour is a GenServer under the hood, we can use all its features like registering it using a name, like we did before. In this case, we'll use :football_match_supervisor.

Let's take our supervisor for a spin. We'll start a child with a :match_id, check the initial state, and add a home goal.

Shell

$ iex -S mix
iex(1)> FootballMatchSupervisor.start_link()
{:ok, #PID<0.119.0>}
iex(2)> Supervisor.start_child(:football_match_supervisor, [[match_id: :match_123]])
{:ok, #PID<0.121.0>}
iex(3)> FootballMatchTracker.get_score(:match_123)
%{away_score: 0, home_score: 0}
iex(4)> FootballMatchTracker.new_event(:match_123, "home_goal")
:ok
iex(5)> FootballMatchTracker.get_score(:match_123)
%{away_score: 0, home_score: 1}

When we send an unknown message ("goal" is not implemented in our GenServer), we'll get an exception, and the process will crash.

Shell

iex(6)> FootballMatchTracker.new_event(:match_123, "goal")
:ok
13:13:44.658 [error] GenServer :match_123 terminating
** (UndefinedFunctionError) function FootballMatchTracker.terminate/2 is undefined or private
    (supervisors_example) FootballMatchTracker.terminate({{:case_clause, "goal"}, [{FootballMatchTracker, :handle_cast, 2, [file: 'lib/football_match_tracker.ex', line: 24]}, {:gen_server, :try_dispatch, 4, [file: 'gen_server.erl', line: 601]}, {:gen_server, :handle_msg, 5, [file: 'gen_server.erl', line: 667]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 247]}]}, %{away_score: 0, home_score: 1})
    (stdlib) gen_server.erl:629: :gen_server.try_terminate/3
    (stdlib) gen_server.erl:795: :gen_server.terminate/7
    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
Last message: {:"$gen_cast", {:event, "goal"}}
State: %{away_score: 0, home_score: 1}
iex(8)> Process.whereis(:match_123)
#PID<0.127.0>

Because we used the :transient as the restart option, and :simple_one_for_one as the restart strategy for our supervisor, the supervisor's children will only be restarted on abnormal termination, like the exception above. Like before, the process is restarted, which brings it back to its initial state.

When we stop the process using the "end"-message, the supervisor won't restart it.

Shell

iex(9)> FootballMatchTracker.new_event(:match_123, "end")
:ok
iex(10)> Process.whereis(:match_123)
nil

Inside the Supervisor

Now that we've seen some examples of how to use supervisors, let's take it a step further, and try to figure out how they work internally.

A supervisor is basically a GenServer with the capability of starting, supervising and restarting processes. The child processes are linked to the supervisor, meaning the supervisor receives an :EXIT message whenever one of its children crash, which prompts it to restart it.

So, if we want to implement our own supervisor, we need to start a linked process for each of its children. If one crashes, we'll catch the :EXIT message, and we'll start it again.

Elixir

defmodule MySupervisor do
  use GenServer
 
  def start_link(args, opts) do
    GenServer.start_link(__MODULE__, args, opts)
  end
 
  def init([children: children]) do
    Process.flag(:trap_exit, true) # for handling EXIT messages
    state =
      Enum.map(children,
        fn child ->
          {:ok, pid} = child.start_link()
          {pid, child}
      end)
      |> Enum.into(%{})
    {:ok, state}
  end
 
  def handle_info({:EXIT, from, reason}, state) do
    IO.puts "Exit pid: #{inspect from} reason: #{inspect reason}"
    child = state[from]
    {:ok, pid} = child.start_link()
    {:noreply, Map.put(state, pid, child)}
  end
end

Let's try it with our Cache module:

Shell

$ iex -S mix
iex(1)> MySupervisor.start_link([children: [Cache]], [])
{:ok, #PID<0.108.0>}
iex(2)> GenServer.cast(:cache, {:save, :hola})
:ok
iex(3)> Process.whereis(:cache)
#PID<0.109.0>

If we kill the process, like we did before, our custom supervisor will automatically restart it.

Shell

iex(4)> :cache |> Process.whereis |> Process.exit(:kill)
Exit pid: #PID<0.109.0> reason: :killed
true
iex(5)> Process.whereis(:cache)
#PID<0.113.0>
iex(6)> GenServer.call(:cache, :get)
:empty

Our supervisor receives a list of children modules through the start_link/2 function, which are started by the init/0 function.
By calling Process.flag(:trap_exit, true), we'll make sure the supervisor doesn't crash when one of its children do.
Instead, the supervisor will receive an :EXIT message. When that happens, our supervisor finds the child module from the state of the crashed process and starts it again in a new one.

Conclusion

By learning how to use the Supervisor behaviour module, we learned quite a bit about building fault-tolerant applications in Elixir. Of course, there's more to supervisors than we could cover in this article, and the different options (like restarting strategies) can be found in the Elixir documentation.

We’d love to know how you liked this article, if you have any questions about it, and what you’d like to read about next, so be sure to let us know at @AppSignal.