Supervisors: Building fault-tolerant Elixir applications

Gonzalo Jiménez Fuentes

Gonzalo Jiménez Fuentes on

Building fault-tolerant
Elixir applications

We briefly touched on supervision when we talked about processes in the first edition of Elixir Alchemy. In this edition, we'll take it a step further by explaining how supervision works in Elixir, and we'll give you an introduction into building fault tolerant applications.

A phrase you'll likely run into when reading up on fault tolerance in Elixir and Erlang is "Let it crash". Instead of preventing exceptions from happening, or catching them immediately when they occur, you're usually advised not to do any defensive programming. That might sound counter-intuitive — how do crashing processes help with building fault tolerant applications? Supervisors are the answer.

Fault tolerance

Instead of taking down the whole system when one of its components fail, fault tolerant applications can recover from exceptions by restarting the affected parts while the rest of the system keeps running.

In Elixir, supervisors are tasked with restarting processes when they fail. Instead of trying to handle all possible exceptions within a process, the "Let it crash"-philosophy shifts the burden of recovering from such failures to the process' supervisor.

The supervisor makes sure the process is restarted if needed, bringing it back to its initial state, ready to accept new messages.


To see how supervisors work, we'll use a GenServer with some state. We'll implement a cast to store a value, and a call to retrieve that value later.

When started, our GenServer sets its initial state to :empty and registers itself by the name :cache, so we can refer to it later.

# lib/cache.ex defmodule Cache do use GenServer def start_link() do GenServer.start_link(__MODULE__, :empty, [name: :cache]) end def handle_call(:get, _from, state) do {:reply, state, state} end def handle_cast({:save, new}, _state) do {:noreply, new} end end

Let's jump into IEx to supervise our GenServer. The Supervisor has to be started with a list of workers. In our case, we'll use a single worker with the module name (Cache), and an empty list of arguments (because Cache.start_link/0 doesn't take any).

$ iex -S mix iex(1)> import Supervisor.Spec Supervisor.Spec iex(2)> {:ok, _pid} = Supervisor.start_link([worker(Cache, [])], strategy: :one_for_one) {:ok, #PID<0.120.0>} iex(3)> GenServer.call(:cache, :get) :empty iex(4)> GenServer.cast(:cache, {:save, :hola}) :ok iex(5)> GenServer.call(:cache, :get) :hola

If the process crashes, our supervisor will automatically restart it. Let's try that by killing the process manually.

... iex(6)> pid = Process.whereis(:cache) #PID<0.121.0> iex(7)> Process.exit(pid, :kill) true iex(8)> GenServer.call(:cache, :get) :empty iex(9)> Process.whereis(:cache) #PID<0.127.0>

As you can see, the :cache process was restarted by our supervisor immediately when it crashed, and getting its value revealed that it returned to its initial state (:empty).

Dynamic Supervisor

In our first example, the process we supervised was built to run indefinitely. In some cases, however, you'd want your application to spawn processes when needed, and shut them down when their work is done.

Imagine that we want to track football matches. When a match starts, we'll start a process. We'll send messages to that process to update the score, and this process will live until the match ends.

To try this out, we'll define another GenServer named FootballMatchTracker, which we can use to store and fetch the current score for both teams.

# lib/football_match_tracker.ex defmodule FootballMatchTracker do def start_link([match_id: match_id]) do GenServer.start_link(__MODULE__, :ok, [name: match_id]) end def new_event(match_id, event) do GenServer.cast(match_id, {:event, event}) end def get_score(match_id) do GenServer.call(match_id, :get_score) end def init(:ok) do {:ok, %{home_score: 0, away_score: 0}} end def handle_call(:get_score, _from, state) do {:reply, state, state} end def handle_cast({:event, event}, state) do new_state = case event do "home_goal" -> %{state | home_score: state[:home_score] + 1} "away_goal" -> %{state | away_score: state[:away_score] + 1} "end" -> Supervisor.terminate_child(:football_match_supervisor, self()) end {:noreply, new_state} end end

Next, we'll implement a supervisor for FootballMatchTracker.

# lib/football_match_supervisor.ex defmodule FootballMatchSupervisor do use Supervisor def start_link do Supervisor.start_link(__MODULE__, [], [name: :football_match_supervisor]) end def init([]) do children = [ worker(FootballMatchTracker, [], restart: :transient) ] supervise(children, strategy: :simple_one_for_one) end end

Each FootballMatchTracker will be registered with a match identifier that will be given to it through its initialization. Since the Supervisor behaviour is a GenServer under the hood, we can use all its features like registering it using a name, like we did before. In this case, we'll use :football_match_supervisor.

Let's take our supervisor for a spin. We'll start a child with a :match_id, check the initial state, and add a home goal.

$ iex -S mix iex(1)> FootballMatchSupervisor.start_link() {:ok, #PID<0.119.0>} iex(2)> Supervisor.start_child(:football_match_supervisor, [[match_id: :match_123]]) {:ok, #PID<0.121.0>} iex(3)> FootballMatchTracker.get_score(:match_123) %{away_score: 0, home_score: 0} iex(4)> FootballMatchTracker.new_event(:match_123, "home_goal") :ok iex(5)> FootballMatchTracker.get_score(:match_123) %{away_score: 0, home_score: 1}

When we send an unknown message ("goal" is not implemented in our GenServer), we'll get an exception, and the process will crash.

iex(6)> FootballMatchTracker.new_event(:match_123, "goal") :ok 13:13:44.658 [error] GenServer :match_123 terminating ** (UndefinedFunctionError) function FootballMatchTracker.terminate/2 is undefined or private (supervisors_example) FootballMatchTracker.terminate({{:case_clause, "goal"}, [{FootballMatchTracker, :handle_cast, 2, [file: 'lib/football_match_tracker.ex', line: 24]}, {:gen_server, :try_dispatch, 4, [file: 'gen_server.erl', line: 601]}, {:gen_server, :handle_msg, 5, [file: 'gen_server.erl', line: 667]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 247]}]}, %{away_score: 0, home_score: 1}) (stdlib) gen_server.erl:629: :gen_server.try_terminate/3 (stdlib) gen_server.erl:795: :gen_server.terminate/7 (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3 Last message: {:"$gen_cast", {:event, "goal"}} State: %{away_score: 0, home_score: 1} iex(8)> Process.whereis(:match_123) #PID<0.127.0>

Because we used the :transient as the restart option, and :simple_one_for_one as the restart strategy for our supervisor, the supervisor's children will only be restarted on abnormal termination, like the exception above. Like before, the process is restarted, which brings it back to its initial state.

When we stop the process using the "end"-message, the supervisor won't restart it.

iex(9)> FootballMatchTracker.new_event(:match_123, "end") :ok iex(10)> Process.whereis(:match_123) nil

Inside the Supervisor

Now that we've seen some examples of how to use supervisors, let's take it a step further, and try to figure out how they work internally.

A supervisor is basically a GenServer with the capability of starting, supervising and restarting processes. The child processes are linked to the supervisor, meaning the supervisor receives an :EXIT message whenever one of its children crash, which prompts it to restart it.

So, if we want to implement our own supervisor, we need to start a linked process for each of its children. If one crashes, we'll catch the :EXIT message, and we'll start it again.

defmodule MySupervisor do use GenServer def start_link(args, opts) do GenServer.start_link(__MODULE__, args, opts) end def init([children: children]) do Process.flag(:trap_exit, true) # for handling EXIT messages state = Enum.map(children, fn child -> {:ok, pid} = child.start_link() {pid, child} end) |> Enum.into(%{}) {:ok, state} end def handle_info({:EXIT, from, reason}, state) do IO.puts "Exit pid: #{inspect from} reason: #{inspect reason}" child = state[from] {:ok, pid} = child.start_link() {:noreply, Map.put(state, pid, child)} end end

Let's try it with our Cache module:

$ iex -S mix iex(1)> MySupervisor.start_link([children: [Cache]], []) {:ok, #PID<0.108.0>} iex(2)> GenServer.cast(:cache, {:save, :hola}) :ok iex(3)> Process.whereis(:cache) #PID<0.109.0>

If we kill the process, like we did before, our custom supervisor will automatically restart it.

iex(4)> :cache |> Process.whereis |> Process.exit(:kill) Exit pid: #PID<0.109.0> reason: :killed true iex(5)> Process.whereis(:cache) #PID<0.113.0> iex(6)> GenServer.call(:cache, :get) :empty
  1. Our supervisor receives a list of children modules through the start_link/2 function, which are started by the init/0 function.
  2. By calling Process.flag(:trap_exit, true), we'll make sure the supervisor doesn't crash when one of its children do.
  3. Instead, the supervisor will receive an :EXIT message. When that happens, our supervisor finds the child module from the state of the crashed process and starts it again in a new one.


By learning how to use the Supervisor behaviour module, we learned quite a bit about building fault-tolerant applications in Elixir. Of course, there's more to supervisors than we could cover in this article, and the different options (like restarting strategies) can be found in the Elixir documentation.

We’d love to know how you liked this article, if you have any questions about it, and what you’d like to read about next, so be sure to let us know at @AppSignal.

Become our next author!

Find out more

AppSignal monitors your apps

AppSignal provides insights for Ruby, Rails, Elixir, Phoenix, Node.js, Express and many other frameworks and libraries. We are located in beautiful Amsterdam. We love stroopwafels. If you do too, let us know. We might send you some!

Discover AppSignal
AppSignal monitors your apps