
At AppSignal we use AppSignal to monitor itself. Recently, we noticed a problem with lag on a MongoDB replica set and used Custom Metrics and Anomaly Detection to track the lag and set up alerts on spikes. We track a lot of such metrics to continuously monitor our own infrastructure.
This may sound silly, but recently we got really happy again about this setup. We decided not to be too modest about it this time because it was such a great example of our custom metrics and alerts and we got really excited about it. We hope you enjoy the explanation and we will then slip back into being modest again ;-)
How we use MongoDB
To present AppSignal users with relevant data for performance incidents and/or exceptions, we store data related to requests in what we call samples. A sample describes one request or background job that happened in an application.
These samples are stored in MongoDB replica sets that consist of 1 primary and 2 secondary machines. The replica set makes sure we have redundant copies of our data and additional servers to perform read-only queries.
Lag
Recently, we noticed that at times one of the secondaries couldn’t keep up with the primary machine. Even though, in theory, all machines have the same specs, sometimes you find that a machine performs worse for a number of reasons, like noisy neighbors, for instance.
-> We are aware that there are dedicated MongoDB monitoring products and even one included in MongoDB itself, but a major goal for us was to have all the relevant data for our entire stack in our own app in order to centralize all dashboards and alerts. Having the ability to quickly create a custom dashboard with related MongoDB and application metrics during an outage proved extremely valuable.
Track All the Things
With our Operations team's motto being “Track all the things”, we wanted to graph the replica set lag for all machines in all the clusters, in order to set alerts on them, so that we would be notified when issues arose.
MongoDB has a command that displays the state of each machine in a replica set called replSetGetStatus
. On the mongo
console you can type rs.status()
to get output like this:
logentries2:PRIMARY> rs.status() { "set" : "logentries2", "date" : ISODate("2019-02-02T11:37:13.574Z"), "myState" : 1, "term" : NumberLong(-1), "heartbeatIntervalMillis" : NumberLong(2000), "optimes" : { "lastCommittedOpTime" : { "ts" : Timestamp(1549107433, 1461), "t" : NumberLong(-1) }, "appliedOpTime" : Timestamp(1549107433, 1484), "durableOpTime" : Timestamp(1549107433, 1462) }, "members" : [ { "_id" : 0, "name" : "logentries2-mongo1:27017", "health" : 1, "state" : 1, "stateStr" : "PRIMARY", "uptime" : 161339, "optime" : Timestamp(1549107433, 1484), "optimeDate" : ISODate("2019-02-02T11:37:13Z"), "electionTime" : Timestamp(1548947860, 1), "electionDate" : ISODate("2019-01-31T15:17:40Z"), "configVersion" : 4, "self" : true }, { "_id" : 1, "name" : "logentries2-mongo2:27017", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 159235, "optime" : Timestamp(1549107432, 223), "optimeDurable" : Timestamp(1549107432, 223), "optimeDate" : ISODate("2019-02-02T11:37:12Z"), "optimeDurableDate" : ISODate("2019-02-02T11:37:12Z"), "lastHeartbeat" : ISODate("2019-02-02T11:37:12.367Z"), "lastHeartbeatRecv" : ISODate("2019-02-02T11:37:11.693Z"), "pingMs" : NumberLong(0), "syncingTo" : "logentries2-mongo1:27017", "configVersion" : 4 }, { "_id" : 2, "name" : "logentries2-mongo3:27017", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 161327, "optime" : Timestamp(1549107433, 709), "optimeDurable" : Timestamp(1549107433, 709), "optimeDate" : ISODate("2019-02-02T11:37:13Z"), "optimeDurableDate" : ISODate("2019-02-02T11:37:13Z"), "lastHeartbeat" : ISODate("2019-02-02T11:37:13.312Z"), "lastHeartbeatRecv" : ISODate("2019-02-02T11:37:12.112Z"), "pingMs" : NumberLong(1), "syncingTo" : "logentries2-mongo1:27017", "configVersion" : 4 } ], "ok" : 1 }
We can use this information to track how much the secondary machines are behind the primary one.
Tracking the Lag in Appsignal
Since we use Mongoid on Ruby, we can ask the Mongoid client for this status. Because we have multiple replica sets, we loop the clients and get the status for each replica set and server.
class ReplicasetLagMonitor attr_reader :status def initialize(client) @status = client .with(:database => "admin", :user => "admin-readonly", :password => ENV["MONGOID_MONITORING_PASSWORD"]) .command(:replSetGetStatus => 1) .first end def primary status["members"].find{ |member| member["stateStr"] == "PRIMARY" } end # Depending on the version of MongoDB it either returns a `BSON::Timestamp` directly # or it returns the `BSON::Timestamp` in a `ts` object. def optime(member) if member["optime"].is_a?(BSON::Timestamp) member["optime"].seconds else member["optime"]["ts"].seconds end end def hostname(member) member["name"].split(":").first end def monitor(member) lag = (optime(primary) - optime(member)) * 1000 # Convert to milliseconds Appsignal.set_gauge( "replicaset_lag", lag, {:replicaset => status["set"], :hostname => hostname(member)} ) end def monitor_members! status["members"].each do |member| monitor(member) end end def self.monitor_all Mongoid.clients.keys.each do |key| begin client = Mongoid.client(key) new(client).monitor_members! client.close rescue => error Appsignal.send_error(error) end end end end
The code above sends us the current replica set lag in milliseconds each time we run the script (we run it twice a minute).
With the following Custom Metrics configuration, we can render a graph with the lag of each machine.
- title: MongoDB graphs: - title: "Replica set Lag" line_label: "%replicaset% - %hostname%" kind: gauge format: duration fields: - replicaset_lag headerType: legacy tags: replicaset: "*" hostname: "*"
We set the line_label
with two variables that will be populated from the tags. To create a graph for all the machines, we use a wildcard on both the replica set
name and hostname
.
The configuration will render the following graph:

As you can see, we had a little bit of lag on one of our machines.
Alerts
Now that we have the replica set lag in our system, we can use our Anomaly Detection to send us alerts when one of the secondaries starts to lag behind the primary.
In our case, we set two alerts: one that sends a message to a Slack channel and Email when the lag is greater than 5 minutes and another one that triggers PagerDuty when the lag is greater than 10 minutes.
We use the description field to explain why the error was triggered and what actions have to be taken to solve the issue.

Conclusion
We’re very keen on dogfooding our own product. Custom Metrics together with Anomaly Detection give us the means to track and alert on almost everything in our stack, making AppSignal a one-stop shop for both our product and the infrastructure surrounding it.
Set up your own Custom Metrics and Anomaly Detection with the help of our documentation. You can also contact one of our engineers to help you set it up.
Wondering what you can do next?
Finished this article? Here are a few more things you can do:
- Try out AppSignal with a 30-day free trial.
- Reach out to our support team with any feedback or questions.
- Share this article on social media
Most popular AppSignal articles
Node.js Garbage Collection: Heap Statistics Magic Dashboard
We just released a Magic Dashboard for Garbage Collection stats for our Node.js integration. If you are leaking memory, this dashboard will help you discover and fix this problem.
See moreThe Easiest Way to Monitor Node.js: Automatic Instrumentation
Automatic instrumentation enables AppSignal app to digest, process, monitor, and show you the graphs and dashboards you need the most.
See moreAnnouncing AppSignal for Elixir 2.0
We're now saving you more time by making the installation process smoother and instrumenting Ecto out of the box. We've also laid the groundwork for distributed tracing.
See more

Robert Beekman
As a co-founder, Robert wrote our very first commit. He's also our support role-model and knows all about the tiny details in code. Travels and photographs (at the same time).
All articles by Robert BeekmanBecome our next author!
AppSignal monitors your apps
AppSignal provides insights for Ruby, Rails, Elixir, Phoenix, Node.js, Express and many other frameworks and libraries. We are located in beautiful Amsterdam. We love stroopwafels. If you do too, let us know. We might send you some!
