Node.js Resiliency Concepts: Recovery and Self-Healing

Andrei Gaspar

Andrei Gaspar on

Node.js Resiliency Concepts: Recovery and Self-Healing

In an ideal world where we reached 100% test coverage, our error handling was flawless, and all our failures were handled gracefully โ€” in a world where all our systems reached perfection, we wouldn't be having this discussion.

Yet, here we are. Earth, 2020. By the time you read this sentence, somebody's server failed in production. A moment of silence for the processes we lost.

In this post, I'll go through some concepts and tools which will make your servers more resilient and boost your process management skills.

๐Ÿ‘‹ As you're diving into recover and self-healing, you might want to dive into AppSignal for Node.js as well. We provide you with out-of-the-box support for Node.js Core, Express, Next.js, Apollo Server, node-postgres and node-redis.

Node Index.js

Starting with Node.js โ€” especially if you're new to working with servers โ€” you'll probably want to run your app on the remote production server the very same way you're running it in development.

Install Node.js, clone the repo, give it an npm install, and a node index.js (or npm start) to spin it all up.

I remember this seeming like a bulletproof plan for me starting out. If it works, why fix it, right?

My code would run into errors during development, resulting in crashes, but I fixed those bugs on the spot โ€” so the code on the server is uncorrupted. It cannot crash. Once it starts up, that server is there to stay until the heat death of the universe.

Well, as you probably suspect, that was not the case.

I was facing two main problems that didn't cross my mind back then:

  • What happens if the VM/Host restarts?
  • Servers crash... That's like, their second most popular attribute. If they weren't serving anything, we would call them crashers.

Wolverine vs T-1000

Recovery can be tackled in many different ways. There are convenient solutions to restart our server after crashes, and there are more sophisticated approaches to make it indestructible in production.

Both Wolverine and the T-1000 can take a good beating, but their complexity and recovery rate are very different.

We're looking for distinct qualities based on the environment we're running in. For development, the goal is convenience. For production, it's usually resilience.

We're going to start with the simplest form of recovery and then slowly work our way up to elaborate orchestration solutions.

It is up to you how much effort you'd like to invest in your implementation, but it never hurts having more tools at your disposal, so if this spikes your interest, fasten your seatbelt, and let's dive in!

Solving Problems as They Arise

You're coding away, developing your amazing server.

After every couple of lines, you switch tabs and nudge it with a node index or npm start. This cycle of constant switching and nudging becomes crushingly tedious after a while.

Wouldn't it be nice if it would just restart on its own after you changed the code?

This is where lightweight packages like Nodemon and Node.js Supervisor come into play. You can install them with one line of code and start using them with the next.

To install Nodemon, simply type the below command in your terminal.

npm install -g nodemon

Once installed, just substitute the node command you've been using with the new nodemon command that you now have access to.

nodemon index.js

You can install Node.js Supervisor with a similar approach, by typing the command below.

npm install -g supervisor

Similarly, once installed you can just use the supervisor prefix to run your app.

supervisor index.js

Nodemon and Supervisor are both as useful as they are popular, with the main difference being that Nodemon will require you to make file changes to restart your process, while Supervisor can restart your process when it crashes.

Your server is on the right track. Development speed quadrupled.

These packages do a great job covering development pain-points and they are pretty configurable as well. But the difficulties we are facing in development rarely overlap the ones we're facing in production.

When you deploy to the remote server, it feels like sending your kid to college as an overprotective parent. You want to know your server is healthy, safe, and eats all its veggies.

You'd like to know what problem it faced when it crashed โ€” if it crashed. You want it to be in good hands.

Well, good news! This is where process managers come into the picture. They can babysit your server in production.

Process Management

When you run your app, a process is created.

While running it in development, you would usually open a terminal window and type a command in there. A foreground process is created and your app is running.

Now, if you would close that terminal window, your app would close with it. You'll also notice that the terminal window is blocked. You cannot enter another command before you close the process with Ctrl + C.

The drawback is that the app is tied to the terminal window, but you're also able to read all the logs and errors that the process is throwing. So it's a glass half full.

However, on your production server, you'll want to run in the background, but then you'll lose the convenience of visibility. Frustration is assured.

Process management is tedious.

Luckily, we have process managers! They are processes that manage other processes for us. So meta! But ridiculously convenient.


The most popular process manager for Node.js is called pm2, and it's so popular for a very good reason. It's great!

It's such a fantastic piece of software that it would take me a separate article to describe its awesomeness in its entirety, and just how many convenient features it has. Since we're focused on self-healing, I'll discuss the basics below, but I strongly encourage you to read up on it more in-depth and check all its amazing features.

Installing pm2 is just as easy as installing the packages we discussed above. Simply type the following line in your terminal.

npm install -g pm2

Running your app isn't rocket science either. Just type the command below, where index.js is your main server file.

pm2 start index.js

This time, you might notice something different though. Seemingly, nothing has happened, but if you go on to visit the endpoint to your app, you'll notice that it's up and running.

Remember when we discussed running the process in the background? That is exactly what is happening. pm2 started your server as a background process and it is now managing it for you.

As an added convenience, you can also use the --watch flag to make sure pm2 watches your files for changes and reloads your app to make sure it is always up to date.

To do so, you can use the exact command above, but with the flag appended to the end.

pm2 start index.js --watch

Now, pm2 is watching our files and restarts the process anytime the files change or the process crashes. Perfect! This is exactly what we're after.

It is doing a great job managing our server behind the scenes, but the lack of visibility is anxiety-inducing. What if you want to see your server logs?

pm2 has you covered. Their CLI is really powerful! I'll list some commands below to get you started.

List your applications with the command below.

Commandย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย Description
pm2 listLists your applications. You'll see a numeric id associated with the applications managed by pm2. You can use that id in the commands you'd like to execute.
pm2 logs <id>Checks the logs of your application.
pm2 stop <id>Stops your process. (Just because the process is stopped, it doesn't mean it stopped existing. If you want to completely remove the process, you'll have to use delete)
pm2 delete <id>Deletes the process. (You don't need to stop and delete separately, you can just go straight for delete, which will stop and delete the process for you)

pm2 is insanely configurable and is able to perform Load Balancing and Hot Reload for you. You can read up on all the bells and whistles in their docs, but our pm2 journey comes to a halt here.

Disappointing, I know. But why? I hear you asking.

Remember how convenient it was to install pm2? We installed it using the Node.js package manager. Wink... Pistol finger. Wink-wink.

Wait. Are we using Node.js to monitor Node.js?

That sounds a bit like trusting your child to babysit itself. Is that a good idea? There is no objective answer to that question, but it sure sounds like there should be some other alternatives to be explored.

So, what next? Well, let's explore.


If you're planning to run on a good old Linux VM, I think it might be worth mentioning systemd before jumping onto the deep end of containers and orchestrators.

Otherwise, if you plan to run on a managed application environment (e.g. Azure AppService, AWS Lambda, GCP App Engine, Heroku, etc.), this will not be relevant to your use case, but it might not hurt knowing about it.

So assuming that it's just you, your app, and a Linux VM, let's see what systemd can do for you.

Systemd can start, stop, and restart processes for you, which is exactly what we need. If your VM restarts, systemd makes sure that your app starts up again.

But first, let's make sure you have access to systemd on your VM.

Below is a list of Linux systems that make use of systemd:

  • Ubuntu Xenial (or newer)
  • CentOS 7 / RHEL 7
  • Debian Jessie (or newer)
  • Fedora 15 (or newer)

Let's be realistic, you're probably not using a Linux system from before the great flood, so you'll probably have systemd access.

The second thing that you need is a user with sudo privileges. I'm going to be referring to this user simply as user but you should substitute it with your own.

Since our user is called user and, for this example, I'm using Ubuntu, I'll be referring to your home directory as /home/user/ and I'll go with the assumption that your index.js file is located in your home directory.

The systemd Service File

The systemd file is a useful little file that we can create in the system area that holds the configuration to our service. It is really simple and straightforward, so let's try to set one up.

The systemd files are all located under the directory listed below.


Let's create a new file there with the editor of your choice and populate it with some content. Don't forget to use sudo as a prefix to your command! Everything here is owned by the root user.

Okay, let's start by going into the system directory.

cd /lib/systemd/system

Create a file for your service.

sudo nano myapp.service

And, let's populate it with some content.

# /lib/systemd/system/myapp.service [Unit] Description=My awesome server Documentation= [Service] Environment=NODE_PORT=3000 Environment=NODE_ENV=production Type=simple User=user ExecStart=/usr/bin/node /home/user/index.js Restart=on-failure [Install]

If you glance through the configuration, it's pretty straightforward and self-explanatory, for the most part.

The two settings you might need some hints on are After and Type. means that it should wait for the networking part of the server to be up and running because we need the port. The simple type just means don't do anything crazy, just start and run.

Running Your App with systemctl

Now that our file has been created, let's tell systemd to pick up the changes from the newly created file. You'll have to do this each time you make a change to the file.

sudo systemctl daemon-reload

It is as simple as that. Now that it knows about our service, we should be able to use the systemctl command to start and stop it. We will be referring to it by the service file name.

sudo systemctl start myapp

If you'd like to stop it, you can substitute the start command with stop. If you'd like to restart it, type restart instead.

Now, on to the part we care most about. If you'd like your application to start up automatically when the VM boots, you should execute the command below.

sudo systemctl enable myapp

If you want that behavior to stop, just substitute enable with disable.

It is as simple as that!

So, now we have another system managing our process that is not Node.js itself. This is great! You can proudly give yourself a high five, or maybe an awkward elbow bump depending on the pandemic regulations while reading this article.

Our journey does not stop here though. There's still quite a lot of ground left uncovered, so let's slowly start diving into the world of containers and orchestration.

What are Containers?

To be able to move forward, you need to understand what Containers are and how they work.

There are a lot of container runtime environments out there such as Mesos, CoreOS, LXC, and OpenVz, but the one name that is truly synonymous with containers is Docker. It makes up more than 80% of the containers used and when people mention containers, it's safe to think they are talking about Docker.

So, what do these containers do anyway?

Well, containers contain. They have a very simple and descriptive name in that sense.

Now the question remains, what do they contain?

Containers contain your application and all of its dependencies. Nothing more and nothing less. It is just your app and everything that your app needs to run.

Think about what your Node.js server needs to execute:

  • Node.js (duh')
  • Your index.js file
  • Probably your npm packages (dependencies)

So, if we were creating a container, we would want to make sure these things are present and contained.

If we would have such a container ready, then it could be spun up via the container engine (e.g. Docker).

Containers vs VMs, and Italian Cuisine

Even if you haven't worked much with Virtual Machines, I think you have a general idea about how they work. You've probably seen your friend running a Windows machine with Linux installed on it, or a macOS with an additional Windows installation, etc.

So the idea there is that you have your Physical Machine and an Operating System on top, which then contains your app and its dependencies.

Let's imagine we're making pizza.

  • The Machine is the Table
  • The OS is the Pizza Dough
  • And, your app together with its dependencies are the ingredients on top

Now, let's say you'd like to eat 5 types of pizza, what should you do?

The answer is to make 5 different pizzas on the same table. That's the VM's answer.

But here comes Docker and it says: "Hey, that's a lot of waste! You're not going to eat 5 pizzas, and making the dough is hard work. What about using the same dough?"

You might be thinking, hey that's not a bad idea actually โ€” but I don't want my friend's disgusting pineapple flavor (sorry, not sorry) spilling over into my yummy 4 cheese. The ingredients are conflicting!

And here's where Docker's genius comes in: "Don't worry! We'll contain them. Your 4 cheese part won't even know about the pineapple part."

So Docker's magic is that it's able to use the same underlying Physical Machine and Operating System to run well-contained applications of many different "flavors" without them ever conflicting with each other. And to keep exotic fruit off your pizza.

Alright, let's move on to creating our first Docker Container.

Creating a Docker Container

Creating a Docker container is really easy, but you'll need to have Docker installed on your machine.

You'll be able to install Docker regardless of your Operating System. It has support for Linux, Mac, and Windows, but I would strongly advise sticking to Linux for production.

Once you have Docker installed, it is time to create the container!

Docker looks for a specific file called Dockerfile and it will use it to create a recipe for your container that we call a Docker Image. So before we create a container, we'll have to create that file.

Let's create this file in the same directory we have our index.js file and package.json.

# Dockerfile # Base image (we need Node) FROM node:12 # Work directory WORKDIR /usr/myapp # Install dependencies COPY ./package*.json ./ RUN npm install # Copy app source code COPY ./ ./ # Set environment variables you need (if you need any) ENV NODE_ENV='production' ENV PORT=3000 # Expose the port 3000 on the container so we can access it EXPOSE 3000 # Specify your start command, divided by commas CMD [ "node", "index.js" ]

It is smart to use a .dockerignore file in the same directory to ignore files and directories you might not want to copy. You can think of it as working the same as .gitignore

# .dockerignore node_modules npm-debug.log

Now that you have everything set up, it's time to build the Docker Image!

You can think of an image as a recipe for your container. Or, if you're old enough, you might remember having disks for software installers. It wasn't the actual software running on it, but it contained the packaged software data.

You can use the command below to create the image. You can use the -t flag to name your image and find it easier later. Also, make sure you opened up the terminal to the directory where your Dockerfile is located.

docker build -t myapp .

Now, if you list your images, you'll be able to see your image on the list.

docker image ls

If you have your image ready, you're just one command away from having your container up and running.

Let's execute the command below to spin it up.

docker run -p 3000:3000 myapp

You'll be able to see your server starting up with the container and read your logs in the process. If you'd like to spin it up in the background, use the -d flag before your image name.

Also, if you're running the container in the background, you can print a list of containers using the command below.

docker container ls

So far so good! I think you should have a pretty good idea about how containers work at this point, so instead of diving into the details, let's move ahead to a topic very closely tied to recovery: Orchestration!


If you don't have an operations background, chances are you're thinking about containers as some magical sophisticated components. And you would be right in thinking that. They are magical and complex. But it doesn't help to have that model in our minds, so it's time to change that.

It's best to think about them as the simplest components of our infrastructure, sort of like Lego blocks.

Ideally, you don't even want to be managing these Lego blocks individually because it's just too fiddly. You'd want another entity that handles them for you, sort of like the process manager that we discussed earlier.

This is where Orchestrators come into play.

Orchestrators help you manage and schedule your containers and they allow you to do this across multiple container hosts (VMs) distributed across multiple locations.

The orchestrator feature that interests us the most in this context is Replication!

Replication and High Availability

Restarting our server when it crashes is great, but what happens during the time our server is restarting? Should our users be waiting for the service to get back up? How do they know it will be back anyway?

Our goal is to make our service Highly Available, meaning that our users are able to use our app even if it crashes.

But how can it be used if it's down?

Simple. Make copies of your server and run them simultaneously!

This would be a headache to set up from scratch, but luckily, we have everything that we need to enable this mechanism. Once your app is containerized, you can run as many copies of it as you'd like.

These copies are called Replicas.

So let's look into how we would set up something like this using a container orchestration engine. There are quite a few out there, but the easiest one to get started with is Docker's orchestration engine, Docker Swarm.

Replication in Swarm

If you have Docker installed on your machine, you're just one command away from using Docker Swarm.

docker swarm init

This command enables Docker Swarm for you and it allows you to form a distributed cluster by connecting other VMs to the Swarm. For this example, we can just use a single machine.

So, with Docker Swarm enabled, we now have access to the components called services. They are the bread and butter of a microservice style architecture, and they make it easy for us to create replicas.

Let's create a service! Remember the image name we used when we built our Docker image? It's the same image we're going to use here.

docker service create --name myawesomeservice --replicas 3 myapp

The command above will create a service named myawesomeservice and it will use the image named myapp to create 3 identical containers.

You'll be able to list your services with the command below.

docker service ls

You can see that there's a service with the name you specified.

To be able to see the containers that have been created, you can use the following command:

docker container ls

Now that our server is running replicated, the service will make sure to always restart the container if it crashes, and it can offer access to the healthy containers throughout the process.

If you'd like to adjust the number of replicas of a service, you can use the command below.

docker service scale <name_of_service>=<number_of_replicas>

For example:

docker service scale myapp=5

You're able to run as many replicas as you'd like, just as simple as that.

Isn't that awesome? Let's look at one last example and see how we would approach replication in Kubernetes.

Replication in Kubernetes

It's hard to skip Kubernetes in a discussion about orchestration. It's the gold standard when it comes to orchestration, and rightfully so.

I think Kubernetes has a much steeper learning curve than Swarm, so if you're just getting started with containers I'd suggest picking up Swarm first. That said, it doesn't hurt to have a general understanding of how this would work in the world of K8S.

If you don't feel like installing minikube or you don't want to fiddle with cloud providers, there's an easy option to dabble in Kubernetes for a bit, by using the Play with Kubernetes online tool. It gives you a 4-hour session which should be more than enough for small experiments.

To be able to follow this exercise, please make sure that you created a DockerHub account, and pushed up the docker image to your repo!

We're going to create two components by creating two .yml configuration files:

  • A Cluster IP Service โ€” this is going to open up a port for us to communicate with our app.
  • A Deployment โ€” which is sort of like a service in Docker Swarm, with a bit more bells and whistles.

Let's first start with the ClusterIP. Create a cluster-ip.yml file and paste the following content into it.

# cluster-ip.yml apiVersion: v1 kind: Service metadata: name: cluster-ip-service spec: type: ClusterIP selector: component: server ports: - port: 3000 targetPort: 3000

Let's create a Deployment as well. Within a deployment.yml file, you can paste the following content.

# deployment.yml apiVersion: apps/v1 kind: Deployment metadata: name: server-deployment spec: replicas: 3 selector: matchLabels: component: server template: metadata: labels: component: server spec: containers: - name: server image: your_docker_user/your_image ports: - containerPort: 3000

You'll need to make sure that you substituted the your_docker_user/your_image with your actual user and image name and you have that image hosted on your Docker repo.

Now that we have these two files ready, all we need to do to spin this up is to execute the command below. Make sure you're executing it in the directory that contains the files.

kubectl apply -f .

You can now check if your server is up and running by listing the deployments and services.

kubectl get deployments
kubectl get services

If everything worked out according to plan, you should be able to copy-paste the IP and Port from your cluster-ip-service into your browser's address bar to access your application.

To see the replicas that have been created, you can use the following command:

kubectl get pods

The pods listed should correspond to the number of replicas you specified in your deployment.yml file.

To clean up all the components, you can simply execute:

kubectl delete -f .

And just like that, we learned about Replication within Kubernetes as well.


So, we have an application that recovers and is highly available. Is that all there is to it?

Not at all. In fact, now that your app doesn't "go down", how do you know what issues it might be having?

By looking at the logs? Be honest. If your app is up every time you check the endpoint, you'll probably check the logs about two times per year. There's more interesting stuff to look at on social media.

So, to make sure your app is improving, you'll have to start thinking about monitoring, error handling, and error propagation. You'll have to make sure that you're aware of issues as they arise, and you're able to fix them even if they don't keep your server down.

That's a topic for another time though, I hope you enjoyed this article and it was able to shed some light on some of the approaches you could use to enable recovery for your Node.js application.

P.S. If you liked this post, subscribe to our new JavaScript Sorcery list for a monthly deep dive into more magical JavaScript tips and tricks.

P.P.S. If you'd love an all-in-one APM for Node.js or you're already familiar with AppSignal, go and check out AppSignal for Node.js.

Andrei Gaspar

Andrei Gaspar

Daydreaming about APIs and imagining web services โ€” our guest author Andrei is a solutions architect by day and the co-founder of Boardme by night. When he's not typing frantically in a terminal, he's exploring nature, pretends to draw, and supplies bystanders with unsolicited gym advice.

All articles by Andrei Gaspar

Become our next author!

Find out more

AppSignal monitors your apps

AppSignal provides insights for Ruby, Rails, Elixir, Phoenix, Node.js, Express and many other frameworks and libraries. We are located in beautiful Amsterdam. We love stroopwafels. If you do too, let us know. We might send you some!

Discover AppSignal
AppSignal monitors your apps