It’s important to be able to look at the entirety
of your application architecture, not just specific aspects of it, and understand
how different parts connect. Observability comes first, followed by
In this post, we’ll dive into the database part of your architecture to
show how you can monitor and optimize your database performance. Most of
the following principles apply, regardless of your stack, or whether you
want to monitor and optimize performance for PostgreSQL, MongoDB, or any
Why Monitor Database Performance?
Let’s say a query is very slow, and your app has to wait for results -
causing things to grind to a halt for your end-users and/or in a part of
Like in a traffic jam, a part might end up standing still as an effect
of something taking place elsewhere. If a lot of queries take too long
to resolve, this might be because other queries get in the way.
Your goal is to find out where an issue is coming from before things
start going wrong in other parts of your app.
The next step is to proactively use monitoring to optimize your
architecture and see how you can make things faster. But we are getting
ahead of ourselves - let’s start with step 1:
Step 1: Measuring AKA Instrumenting Database Performance
Database monitoring begins with making sure you measure the right
- the means and 90th percentiles of queries duration
- the total throughput - gives you another way to set triggers/flag when throughput is an issue
Luckily, getting that data has already been done for you: your APM
software likely lets you configure what you want to instrument and
measure. If you use AppSignal, collecting and aggregating the data
happens automatically, without much setup. We’ve done the heavy lifting
Step 2: A Starting Point - Find High-impact Queries
Once you have your app instrumented and real-world data starts coming
through, you’ll see patterns emerge. On AppSignal, we’ve built screens
for the slowest API requests and the slowest queries under ‘Improve’ in
This screen will give you a great starting point for optimization. By
default, slow queries are sorted by impact. They are queries that you
often do. The impact is the throughput x the mean query time.
Toggling between these sortings will give you an idea of the queries
with the most volume, the ones that are the slowest (but are perhaps
once-a-night background jobs that you can easily ignore), and those that
have the most impact.
A Quick Side-step: Fix Some Individual Queries
Now you can find some queries to dive into fixing.
If you click on a slow query, you will see all of its details on the
right, from the throughput and response time for the query to the query
Hover over the response time on a peak, and you can then zoom in by
clicking on the 'what happened here’ option. You’ll see what errors
happened at that point in time, in that sample, and what happened on the
We’ve made it very visible for one particular anti-pattern: N+1. The N+1
anti-pattern happens when a query is executed for every result of a
The query count is N + 1, with N being the number of queries for the
initial query results. If that initial query has one result, N+1 = 2. If
it has 1,000 results, N+1 = 1,001 queries. Boom.
If queries follow the N+1 anti-pattern, that’s a happy accident - they
are labeled here so you can quickly find them:
We won’t go into how to solve them here - we’ve written a separate
blog post on N+1 queries.
Alright, now let’s assume we fixed some crazy things, and as we fixed
them, we noticed when they happened and the worst response times.
Finding Deciding on Patterns - What Is Wrong?
Monitoring is often about looking at when things go wrong. But part of
the complexity around monitoring is defining what should be considered
'wrong’ for your setup. Also, would you rather be alerted by an issue
that then solves itself or miss one that doesn’t?
Let’s look at the throughput and duration of your queries on average.
You will be able to set that up on your APM software. If you’re using
AppSignal to monitor Node.js and use PostgreSQL as your database, we
automatically create dashboards with the most important metrics for
The node-postgres dashboard will show data like this:
If you’d like to set this up for your app, here’s some further reading.
When you switch between different time ranges, you will see short peaks
that disappear in the average of an hour, as well as peaks that stay
high when you zoom out to a more extended time range.
From our experience of having seen the slowest query screen, we now have
an idea about the main culprits and when they peaked.
Certain short peaks in throughput might be fine - the host might be
running other processes that make a query slow, for example.
You’ll want to zoom into prolonged longer query times or peaking
throughputs. Looking at the 95th percentile versus the mean can tell you
if a slower query is being experienced by everyone or only by a small
subset, which can sometimes help point you towards what to fix.
Step 4: What Should
Keep Get You up at Night?
We usually recommend starting on the noisy side of the spectrum: set
some triggers to alert you on a channel that doesn’t wake you up, like
email or Slack, rather than PagerDuty.
Create a custom metrics trigger on the “event_duration” metric and
select the mean field to get an alert on average query time in your app.
Select 90th or 95th percentile to get alerts on the slowest queries your
Then configure the “group” and “namespace” tags for the type of event
you want to be alerted about and in which namespace. For example,
“active_record” (Rails) or “postgres” (Node.js) or “ecto” (Elixir) for
the group, and “web” or “background” for the namespace:
After a week, you’ll start seeing some patterns and get an idea of which
cases require your attention and which cases usually solve themselves.
You can then adjust the triggers to be less noisy and add channels for
when you want to be woken up by specific alerts.
Step 5: Done with Setting Up Database Monitoring
Now, in theory, we should have a nice monitoring setup: the right data
comes in, we’ve set triggers to warn us before the bits hit the fan, and
we’ve found some big culprits. We should be done, right?
In reality, usage of your app might grow and create new bottlenecks. Or
code changes may hit your database, causing the duration to go up. Or a
particular field with a huge payload will trigger something. However
experienced you are, new problems and bottlenecks will occur.
But you can find comfort in the fact that, with the proper setup, you
will be warned and know how to dive deep into your architecture to solve
And perhaps you can find joy in improving the big 'impactful’ queries
before any other screws come loose. It’s what we love best - that, and
Try AppSignal, Get Stroopwafels 🍪
Speaking of stroopwafels, we send them all over the world. Give our
AppSignal APM software for Ruby, Elixir and Node.js a try. If you set up a free trial and reach out to us, we’ll send you a box of
stroopwafels on the house.