appsignal

# Monitoring MongoDB Replicaset Lag with AppSignal

Robert Beekman on

At AppSignal we use AppSignal to monitor itself. Recently, we noticed a problem with lag on a MongoDB replica set and used Custom Metrics and Anomaly Detection to track the lag and set up alerts on spikes. We track a lot of such metrics to continuously monitor our own infrastructure.

This may sound silly, but recently we got really happy again about this setup. We decided not to be too modest about it this time because it was such a great example of our custom metrics and alerts and we got really excited about it. We hope you enjoy the explanation and we will then slip back into being modest again ;-)

## How we use MongoDB

To present AppSignal users with relevant data for performance incidents and/or exceptions, we store data related to requests in what we call samples. A sample describes one request or background job that happened in an application.

These samples are stored in MongoDB replica sets that consist of 1 primary and 2 secondary machines. The replica set makes sure we have redundant copies of our data and additional servers to perform read-only queries.

## Lag

Recently, we noticed that at times one of the secondaries couldn’t keep up with the primary machine. Even though, in theory, all machines have the same specs, sometimes you find that a machine performs worse for a number of reasons, like noisy neighbors, for instance.

-> We are aware that there are dedicated MongoDB monitoring products and even one included in MongoDB itself, but a major goal for us was to have all the relevant data for our entire stack in our own app in order to centralize all dashboards and alerts. Having the ability to quickly create a custom dashboard with related MongoDB and application metrics during an outage proved extremely valuable.

## Track All the Things

With our Operations team's motto being “Track all the things”, we wanted to graph the replica set lag for all machines in all the clusters, in order to set alerts on them, so that we would be notified when issues arose.

MongoDB has a command that displays the state of each machine in a replica set called replSetGetStatus. On the mongo console you can type rs.status() to get output like this:

logentries2:PRIMARY> rs.status()
{
"set" : "logentries2",
"date" : ISODate("2019-02-02T11:37:13.574Z"),
"myState" : 1,
"term" : NumberLong(-1),
"heartbeatIntervalMillis" : NumberLong(2000),
"optimes" : {
"lastCommittedOpTime" : {
"ts" : Timestamp(1549107433, 1461),
"t" : NumberLong(-1)
},
"appliedOpTime" : Timestamp(1549107433, 1484),
"durableOpTime" : Timestamp(1549107433, 1462)
},
"members" : [
{
"_id" : 0,
"name" : "logentries2-mongo1:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 161339,
"optime" : Timestamp(1549107433, 1484),
"optimeDate" : ISODate("2019-02-02T11:37:13Z"),
"electionTime" : Timestamp(1548947860, 1),
"electionDate" : ISODate("2019-01-31T15:17:40Z"),
"configVersion" : 4,
"self" : true
},
{
"_id" : 1,
"name" : "logentries2-mongo2:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 159235,
"optime" : Timestamp(1549107432, 223),
"optimeDurable" : Timestamp(1549107432, 223),
"optimeDate" : ISODate("2019-02-02T11:37:12Z"),
"optimeDurableDate" : ISODate("2019-02-02T11:37:12Z"),
"lastHeartbeat" : ISODate("2019-02-02T11:37:12.367Z"),
"lastHeartbeatRecv" : ISODate("2019-02-02T11:37:11.693Z"),
"pingMs" : NumberLong(0),
"syncingTo" : "logentries2-mongo1:27017",
"configVersion" : 4
},
{
"_id" : 2,
"name" : "logentries2-mongo3:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 161327,
"optime" : Timestamp(1549107433, 709),
"optimeDurable" : Timestamp(1549107433, 709),
"optimeDate" : ISODate("2019-02-02T11:37:13Z"),
"optimeDurableDate" : ISODate("2019-02-02T11:37:13Z"),
"lastHeartbeat" : ISODate("2019-02-02T11:37:13.312Z"),
"lastHeartbeatRecv" : ISODate("2019-02-02T11:37:12.112Z"),
"pingMs" : NumberLong(1),
"syncingTo" : "logentries2-mongo1:27017",
"configVersion" : 4
}
],
"ok" : 1
}

We can use this information to track how much the secondary machines are behind the primary one.

## Tracking the Lag in Appsignal

Since we use Mongoid on Ruby, we can ask the Mongoid client for this status. Because we have multiple replica sets, we loop the clients and get the status for each replica set and server.

class ReplicasetLagMonitor

def initialize(client)
@status = client
.command(:replSetGetStatus => 1)
.first
end

def primary
status["members"].find{ |member| member["stateStr"] == "PRIMARY" }
end

# Depending on the version of MongoDB it either returns a BSON::Timestamp directly
# or it returns the BSON::Timestamp in a ts object.
def optime(member)
if member["optime"].is_a?(BSON::Timestamp)
member["optime"].seconds
else
member["optime"]["ts"].seconds
end
end

def hostname(member)
member["name"].split(":").first
end

def monitor(member)
lag = (optime(primary) - optime(member)) * 1000 # Convert to milliseconds

Appsignal.set_gauge(
"replicaset_lag",
lag,
{:replicaset => status["set"], :hostname => hostname(member)}
)
end

def monitor_members!
status["members"].each do |member|
monitor(member)
end
end

def self.monitor_all
Mongoid.clients.keys.each do |key|
begin
client = Mongoid.client(key)
new(client).monitor_members!
client.close
rescue => error
Appsignal.send_error(error)
end
end
end
end

The code above sends us the current replica set lag in milliseconds each time we run the script (we run it twice a minute).

With the following Custom Metrics configuration, we can render a graph with the lag of each machine.

- title: MongoDB
graphs:
- title: "Replica set Lag"
line_label: "%replicaset% - %hostname%"
kind: gauge
format: duration
fields:
- replicaset_lag
tags:
replicaset: "*"
hostname: "*"

We set the line_label with two variables that will be populated from the tags. To create a graph for all the machines, we use a wildcard on both the replica set name and hostname.

The configuration will render the following graph:

As you can see, we had a little bit of lag on one of our machines.

Now that we have the replica set lag in our system, we can use our Anomaly Detection to send us alerts when one of the secondaries starts to lag behind the primary.

In our case, we set two alerts: one that sends a message to a Slack channel and Email when the lag is greater than 5 minutes and another one that triggers PagerDuty when the lag is greater than 10 minutes.

We use the description field to explain why the error was triggered and what actions have to be taken to solve the issue.

## Conclusion

We’re very keen on dogfooding our own product. Custom Metrics together with Anomaly Detection give us the means to track and alert on almost everything in our stack, making AppSignal a one-stop shop for both our product and the infrastructure surrounding it.

Set up your own Custom Metrics and Anomaly Detection with the help of our documentation. You can also contact one of our engineers to help you set it up.