We briefly touched on supervision when we talked about processes in the first edition of Elixir Alchemy. In this edition, we'll take it a step further by explaining how supervision works in Elixir, and we'll give you an introduction into building fault tolerant applications.
A phrase you'll likely run into when reading up on fault tolerance in Elixir and Erlang is "Let it crash". Instead of preventing exceptions from happening, or catching them immediately when they occur, you're usually advised not to do any defensive programming. That might sound counter-intuitive — how do crashing processes help with building fault tolerant applications? Supervisors are the answer.
Fault tolerance
Instead of taking down the whole system when one of its components fail, fault tolerant applications can recover from exceptions by restarting the affected parts while the rest of the system keeps running.
In Elixir, supervisors are tasked with restarting processes when they fail. Instead of trying to handle all possible exceptions within a process, the "Let it crash"-philosophy shifts the burden of recovering from such failures to the process' supervisor.
The supervisor makes sure the process is restarted if needed, bringing it back to its initial state, ready to accept new messages.
Supervisors
To see how supervisors work, we'll use a GenServer with some state. We'll implement a cast to store a value, and a call to retrieve that value later.
When started, our GenServer sets its initial state to :empty
and registers itself by the name :cache
, so we can refer to it later.
Let's jump into IEx to supervise our GenServer. The Supervisor has to be started with a list of workers. In our case, we'll use a single worker with the module name (Cache
), and an empty list of arguments (because Cache.start_link/0
doesn't take any).
$ iex -S mix iex(1)> import Supervisor.Spec Supervisor.Spec iex(2)> {:ok, _pid} = Supervisor.start_link([worker(Cache, [])], strategy: :one_for_one) {:ok, #PID<0.120.0>} iex(3)> GenServer.call(:cache, :get) :empty iex(4)> GenServer.cast(:cache, {:save, :hola}) :ok iex(5)> GenServer.call(:cache, :get) :hola
If the process crashes, our supervisor will automatically restart it. Let's try that by killing the process manually.
... iex(6)> pid = Process.whereis(:cache) #PID<0.121.0> iex(7)> Process.exit(pid, :kill) true iex(8)> GenServer.call(:cache, :get) :empty iex(9)> Process.whereis(:cache) #PID<0.127.0>
As you can see, the :cache
process was restarted by our supervisor immediately when it crashed, and getting its value revealed that it returned to its initial state (:empty
).
Dynamic Supervisor
In our first example, the process we supervised was built to run indefinitely. In some cases, however, you'd want your application to spawn processes when needed, and shut them down when their work is done.
Imagine that we want to track football matches. When a match starts, we'll start a process. We'll send messages to that process to update the score, and this process will live until the match ends.
To try this out, we'll define another GenServer named FootballMatchTracker
, which we can use to store and fetch the current score for both teams.
Next, we'll implement a supervisor for FootballMatchTracker
.
Each FootballMatchTracker
will be registered with a match identifier that will be given to it through its initialization. Since the Supervisor
behaviour is a GenServer under the hood, we can use all its features like registering it using a name, like we did before. In this case, we'll use :football_match_supervisor
.
Let's take our supervisor for a spin. We'll start a child with a :match_id
, check the initial state, and add a home goal.
$ iex -S mix iex(1)> FootballMatchSupervisor.start_link() {:ok, #PID<0.119.0>} iex(2)> Supervisor.start_child(:football_match_supervisor, [[match_id: :match_123]]) {:ok, #PID<0.121.0>} iex(3)> FootballMatchTracker.get_score(:match_123) %{away_score: 0, home_score: 0} iex(4)> FootballMatchTracker.new_event(:match_123, "home_goal") :ok iex(5)> FootballMatchTracker.get_score(:match_123) %{away_score: 0, home_score: 1}
When we send an unknown message ("goal"
is not implemented in our GenServer), we'll get an exception, and the process will crash.
iex(6)> FootballMatchTracker.new_event(:match_123, "goal") :ok 13:13:44.658 [error] GenServer :match_123 terminating ** (UndefinedFunctionError) function FootballMatchTracker.terminate/2 is undefined or private (supervisors_example) FootballMatchTracker.terminate({{:case_clause, "goal"}, [{FootballMatchTracker, :handle_cast, 2, [file: 'lib/football_match_tracker.ex', line: 24]}, {:gen_server, :try_dispatch, 4, [file: 'gen_server.erl', line: 601]}, {:gen_server, :handle_msg, 5, [file: 'gen_server.erl', line: 667]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 247]}]}, %{away_score: 0, home_score: 1}) (stdlib) gen_server.erl:629: :gen_server.try_terminate/3 (stdlib) gen_server.erl:795: :gen_server.terminate/7 (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3 Last message: {:"$gen_cast", {:event, "goal"}} State: %{away_score: 0, home_score: 1} iex(8)> Process.whereis(:match_123) #PID<0.127.0>
Because we used the :transient
as the restart option, and :simple_one_for_one
as the restart strategy for our supervisor, the supervisor's children will only be restarted on abnormal termination, like the exception above. Like before, the process is restarted, which brings it back to its initial state.
When we stop the process using the "end"
-message, the supervisor won't restart it.
iex(9)> FootballMatchTracker.new_event(:match_123, "end") :ok iex(10)> Process.whereis(:match_123) nil
Inside the Supervisor
Now that we've seen some examples of how to use supervisors, let's take it a step further, and try to figure out how they work internally.
A supervisor is basically a GenServer with the capability of starting, supervising and restarting processes. The child processes are linked to the supervisor, meaning the supervisor receives an :EXIT
message whenever one of its children crash, which prompts it to restart it.
So, if we want to implement our own supervisor, we need to start a linked process for each of its children. If one crashes, we'll catch the :EXIT
message, and we'll start it again.
Let's try it with our Cache
module:
$ iex -S mix iex(1)> MySupervisor.start_link([children: [Cache]], []) {:ok, #PID<0.108.0>} iex(2)> GenServer.cast(:cache, {:save, :hola}) :ok iex(3)> Process.whereis(:cache) #PID<0.109.0>
If we kill the process, like we did before, our custom supervisor will automatically restart it.
iex(4)> :cache |> Process.whereis |> Process.exit(:kill) Exit pid: #PID<0.109.0> reason: :killed true iex(5)> Process.whereis(:cache) #PID<0.113.0> iex(6)> GenServer.call(:cache, :get) :empty
- Our supervisor receives a list of children modules through the
start_link/2
function, which are started by theinit/0
function. - By calling
Process.flag(:trap_exit, true)
, we'll make sure the supervisor doesn't crash when one of its children do. - Instead, the supervisor will receive an
:EXIT
message. When that happens, our supervisor finds the child module from the state of the crashed process and starts it again in a new one.
Conclusion
By learning how to use the Supervisor
behaviour module, we learned quite a bit about building fault-tolerant applications in Elixir. Of course, there's more to supervisors than we could cover in this article, and the different options (like restarting strategies) can be found in the Elixir documentation.
We’d love to know how you liked this article, if you have any questions about it, and what you’d like to read about next, so be sure to let us know at @AppSignal.