appsignal

Monitoring MongoDB Replicaset Lag with AppSignal

Robert Beekman

Robert Beekman on

Monitoring MongoDB Replicaset Lag with AppSignal

At AppSignal we use AppSignal to monitor itself. Recently, we noticed a problem with lag on a MongoDB replica set and used Custom Metrics and Anomaly Detection to track the lag and set up alerts on spikes. We track a lot of such metrics to continuously monitor our own infrastructure.

This may sound silly, but recently we got really happy again about this setup. We decided not to be too modest about it this time because it was such a great example of our custom metrics and alerts and we got really excited about it. We hope you enjoy the explanation and we will then slip back into being modest again ;-)

How we use MongoDB

To present AppSignal users with relevant data for performance incidents and/or exceptions, we store data related to requests in what we call samples. A sample describes one request or background job that happened in an application.

These samples are stored in MongoDB replica sets that consist of 1 primary and 2 secondary machines. The replica set makes sure we have redundant copies of our data and additional servers to perform read-only queries.

Lag

Recently, we noticed that at times one of the secondaries couldn’t keep up with the primary machine. Even though, in theory, all machines have the same specs, sometimes you find that a machine performs worse for a number of reasons, like noisy neighbors, for instance.

-> We are aware that there are dedicated MongoDB monitoring products and even one included in MongoDB itself, but a major goal for us was to have all the relevant data for our entire stack in our own app in order to centralize all dashboards and alerts. Having the ability to quickly create a custom dashboard with related MongoDB and application metrics during an outage proved extremely valuable.

Track All the Things

With our Operations team's motto being “Track all the things”, we wanted to graph the replica set lag for all machines in all the clusters, in order to set alerts on them, so that we would be notified when issues arose.

MongoDB has a command that displays the state of each machine in a replica set called replSetGetStatus. On the mongo console you can type rs.status() to get output like this:

json
logentries2:PRIMARY> rs.status() { "set" : "logentries2", "date" : ISODate("2019-02-02T11:37:13.574Z"), "myState" : 1, "term" : NumberLong(-1), "heartbeatIntervalMillis" : NumberLong(2000), "optimes" : { "lastCommittedOpTime" : { "ts" : Timestamp(1549107433, 1461), "t" : NumberLong(-1) }, "appliedOpTime" : Timestamp(1549107433, 1484), "durableOpTime" : Timestamp(1549107433, 1462) }, "members" : [ { "_id" : 0, "name" : "logentries2-mongo1:27017", "health" : 1, "state" : 1, "stateStr" : "PRIMARY", "uptime" : 161339, "optime" : Timestamp(1549107433, 1484), "optimeDate" : ISODate("2019-02-02T11:37:13Z"), "electionTime" : Timestamp(1548947860, 1), "electionDate" : ISODate("2019-01-31T15:17:40Z"), "configVersion" : 4, "self" : true }, { "_id" : 1, "name" : "logentries2-mongo2:27017", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 159235, "optime" : Timestamp(1549107432, 223), "optimeDurable" : Timestamp(1549107432, 223), "optimeDate" : ISODate("2019-02-02T11:37:12Z"), "optimeDurableDate" : ISODate("2019-02-02T11:37:12Z"), "lastHeartbeat" : ISODate("2019-02-02T11:37:12.367Z"), "lastHeartbeatRecv" : ISODate("2019-02-02T11:37:11.693Z"), "pingMs" : NumberLong(0), "syncingTo" : "logentries2-mongo1:27017", "configVersion" : 4 }, { "_id" : 2, "name" : "logentries2-mongo3:27017", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 161327, "optime" : Timestamp(1549107433, 709), "optimeDurable" : Timestamp(1549107433, 709), "optimeDate" : ISODate("2019-02-02T11:37:13Z"), "optimeDurableDate" : ISODate("2019-02-02T11:37:13Z"), "lastHeartbeat" : ISODate("2019-02-02T11:37:13.312Z"), "lastHeartbeatRecv" : ISODate("2019-02-02T11:37:12.112Z"), "pingMs" : NumberLong(1), "syncingTo" : "logentries2-mongo1:27017", "configVersion" : 4 } ], "ok" : 1 }

We can use this information to track how much the secondary machines are behind the primary one.

Tracking the Lag in Appsignal

Since we use Mongoid on Ruby, we can ask the Mongoid client for this status. Because we have multiple replica sets, we loop the clients and get the status for each replica set and server.

ruby
class ReplicasetLagMonitor attr_reader :status def initialize(client) @status = client .with(:database => "admin", :user => "admin-readonly", :password => ENV["MONGOID_MONITORING_PASSWORD"]) .command(:replSetGetStatus => 1) .first end def primary status["members"].find{ |member| member["stateStr"] == "PRIMARY" } end # Depending on the version of MongoDB it either returns a `BSON::Timestamp` directly # or it returns the `BSON::Timestamp` in a `ts` object. def optime(member) if member["optime"].is_a?(BSON::Timestamp) member["optime"].seconds else member["optime"]["ts"].seconds end end def hostname(member) member["name"].split(":").first end def monitor(member) lag = (optime(primary) - optime(member)) * 1000 # Convert to milliseconds Appsignal.set_gauge( "replicaset_lag", lag, {:replicaset => status["set"], :hostname => hostname(member)} ) end def monitor_members! status["members"].each do |member| monitor(member) end end def self.monitor_all Mongoid.clients.keys.each do |key| begin client = Mongoid.client(key) new(client).monitor_members! client.close rescue => error Appsignal.send_error(error) end end end end

The code above sends us the current replica set lag in milliseconds each time we run the script (we run it twice a minute).

With the following Custom Metrics configuration, we can render a graph with the lag of each machine.

yaml
- title: MongoDB graphs: - title: "Replica set Lag" line_label: "%replicaset% - %hostname%" kind: gauge format: duration fields: - replicaset_lag headerType: legacy tags: replicaset: "*" hostname: "*"

We set the line_label with two variables that will be populated from the tags. To create a graph for all the machines, we use a wildcard on both the replica set name and hostname.

The configuration will render the following graph:

Replica set lag graph

As you can see, we had a little bit of lag on one of our machines.

Alerts

Now that we have the replica set lag in our system, we can use our Anomaly Detection to send us alerts when one of the secondaries starts to lag behind the primary.

In our case, we set two alerts: one that sends a message to a Slack channel and Email when the lag is greater than 5 minutes and another one that triggers PagerDuty when the lag is greater than 10 minutes.

We use the description field to explain why the error was triggered and what actions have to be taken to solve the issue.

Replica set lag graph

Conclusion

We’re very keen on dogfooding our own product. Custom Metrics together with Anomaly Detection give us the means to track and alert on almost everything in our stack, making AppSignal a one-stop shop for both our product and the infrastructure surrounding it.

Set up your own Custom Metrics and Anomaly Detection with the help of our documentation. You can also contact one of our engineers to help you set it up.

Robert Beekman

Robert Beekman

As a co-founder, Robert wrote our very first commit. He's also our support role-model and knows all about the tiny details in code. Travels and photographs (at the same time).

All articles by Robert Beekman

Become our next author!

Find out more

AppSignal monitors your apps

AppSignal provides insights for Ruby, Rails, Elixir, Phoenix, Node.js, Express and many other frameworks and libraries. We are located in beautiful Amsterdam. We love stroopwafels. If you do too, let us know. We might send you some!

Discover AppSignal
AppSignal monitors your apps