Magicians never share their secrets. But we do. Sign up for our Ruby Magic email series and receive deep insights about garbage collection, memory allocation, concurrency and much more.
We recently released uptime monitoring, a pretty big addition to our set of features. Our customers have often requested it, and it was a logical next step for us to add uptime monitoring to our app.
In today’s post, we’ll explain how we went from considering uptime monitoring impossible to build, to building it in a week. We’ll break down how seemingly over-engineering can really pay off in the end.
We aim to bring all your crucial monitoring needs together in one application - as cleanly designed and easy to use as single-use products but as powerful as broader tools made for enterprises.
We always felt uptime monitoring would be a great addition to our existing tools. Customers frequently asked for it, and we considered it a logical extension of what we offer. Even so, for years, we only dreamt of building and launching uptime monitoring. It seemed like an impossible task.
When we briefly thought about uptime monitoring in previous years, we quickly shelved the idea of actually building it - because it would take not just one but four large engineering projects as ingredients to bake this cake:
The issue: At first, we didn’t have a public endpoint for writing errors, performance samples, and metrics. We only had our extension and agents to collect data.
The solution: In 2019, we launched our front-end error tracking. As a part of that process, we’ve set up a more public API where customers could push data directly, without having to use our integration or agent.
The issue: Before 2019, uptime monitoring would need different graphs than the ones we were using.
The solution: In 2019, we made it possible to build your own dashboards using the same tech as pre-configured dashboards. This made it much easier to create new dashboards or graphs for customers, but also for ourselves.
The issue: AppSignal didn’t have alerts or custom metrics, so we needed to build a pipeline for this new set of alert types. It would have been an enormous project to create a new set of metrics with specific alerts for uptime monitoring only.
The solution: We built custom metrics and anomaly detection so customers could add their own metrics and set alerts on them. We laid down new foundations to build on.
We were perhaps over-engineering before (well, let’s call it super-engineering, so we don’t place judgment on it).
Because we super-engineered these three ingredients, we laid building blocks robust enough to help build other features, like uptime monitoring.
Under the surface, uptime monitoring heavily uses custom metrics and our anomaly detection system. This has the added feature that customers can graph the uptime monitor metrics on any dashboard they’d like.
While the uptime monitor dashboard would use only one metric graph, most of the other data would also be provided by our metric system and time series API.
With these fundamental ingredients in place, we were still missing one key aspect for uptime monitoring: pull vs. push. Our entire stack is built with the assumption that customers push data to us (either on our push API, or our public endpoint).
Uptime monitoring is reversed, we have to pull data from our customers in order to check whether their websites are up or down. This requires different infrastructure (e.g. how do you ensure data is pulled every minute, how do you handle errors, etc.)
And that meant that, in our eyes, uptime monitoring was still a distant dream. Until the unthinkable happened…
Almost every Friday, we get together to discuss possible new features and ‘shape’ new projects.
Some of these meetings are structured according to the 'Disney brainstorm’ style: you brainstorm like a dreamer, a realist, and then a pessimist. This way, you move natural pessimists away from nay-saying, while natural dreamers do the reverse.
When in dreamer mode, the team kept challenging each other to imagine building uptime monitoring. One of the developers took a super pragmatic view and thought about building it with all the fundamentals we had in place. To make this work, uptime monitoring would stick to some of the core assumptions in custom metrics (e.g., one value every minute and no other time intervals).
The only problem was that our infrastructure ingests sent data - but uptime monitoring sends requests out. When two of the developers were talking about that with each other, they found a way to solve it: place some serverless instance (Amazon Lambda) in between that sends HTTP requests to the URL that needs monitoring at set times. This serverless instance could then be like the AppSignal agent that collects data from a customer’s backend in the usual setup, and sends the data to be ingested like backend data would be.
Testing the concept of an Amazon Lambda in between got the team so excited that in a few days they had a proof of concept working.
Using custom metrics, custom dashboards, and anomaly detection as building blocks, they made a new set of metrics, visualization, and alerts.
It actually was more than just a proof of concept. What was built in those few days is what we released as an invite-only beta to customers who had asked about uptime monitoring before.
Once we sent out the invite, customers loved it! Many set it up and some asked for additional features too.
One customer, Maciej, said:
“Uptime monitoring works great! I have one suggestion: it would be very useful if alerts could be 'inspected’ to see HTTP code and the response that was (sic) returned.”
Another customer, Henry, asked for the same feature. This was very doable, so a few days later, we released it, and Henry said:
“I tried it out and it works just as we need. Thank you so much!”
Over the next few days, based on your requests, we also added the option to add things to the HTTP request to test APIs with credentials. It was amazing to see so many customers using it, reaching out with positive feedback and requests for more features!
Make no mistake: this is not a case of 10x magical developers working through nights and all those sorts of myths. If anything, it is the reverse: a decade of steady thorough work that paid off.
To see the result in action, check out our uptime monitoring here.