This is a short detective story about how we tracked down an exception in AppSignal.com's flow that didn't cause an error...
The Impossible Could Not Have Happened
It all started with a support request. A strange support request. A user reported to us that they didn't have access to an organization after accepting an invitation to the organization.
Looking at the organization in our admin panel, we could see that the user was part of this organization. So they should have been able to see the organization's app on signing in. But the impossible could not have happened, therefore the impossible must be possible in spite of appearances, as Hercule Poirot would have said. And indeed, upon closer inspection, we realized that while the organization listed the user as a member, from the user's perspective in the database, they were not part of the organization.
> user = User.find("user_id_2") > user.organization_ids # => [] > organization = Organization.find_by(:slug => "toms-test-apps") > organization.user_ids # => ["user_id_1", "user_id_2"]
The User <=>
Organization association is a 'has and belongs to many' association managed by Mongoid ORM in our MongoDB database.
Something had gone terribly wrong with the associations. Users were only getting linked from one end of the association and not the other. As we do with such reports, we start looking at our private AppSignal instance that monitors AppSignal.com to see what could have gone wrong. But unfortunately, we didn't find any errors that could tell us more.
This meant that we had to do some old fashioned detective work. We grabbed our magnifying glasses and dove headfirst into the code—using order and method, and the little grey cells.
We have a central location where the invitation accepting logic is located, but multiple actions from where it is called. These actions are: when signing up, when signing in, and clicking the accept invitation link while signed in.
For context, our invitation accepting logic lives in a shared module used by certain controllers. The module exposes an accept_open_invitations
method called from the different flows that allow accepting invitations.
# app/controllers/concerns/invitations.rb module Concerns::Invitations def accept_open_invitations # Invitation id set by the InvitationsController when a user is not signed # in or decides to accept it under a different user account. invitation_id = session[:invitation_id] return unless invitation_id invitation = Invitation.find(invitation_id) invitation.accept_for_user(current_user) end end # app/controllers/users/sessions_controller.rb class Users::SessionsController include Concerns::Invitations def create # Do sign in logic... # After sign _in_, do this custom action accept_open_invitations end end # app/controllers/users/registrations_controller.rb class Users::RegistrationsController include Concerns::Invitations def create # Do sign up logic... # After sign _up_, do this custom action accept_open_invitations end end
In the Dark
Even though we figured the problem must have come from one of the sign in/up scenarios, we couldn't figure out which one it was. The code for accepting invitations was working perfectly fine in our automated and manual tests. So was there something else happening in the controllers after the invitations were accepted? Perhaps an incomplete in-memory state that was persisted to the database after the invitation was accepted, thus undoing the invitation?
Without a reproducible state, we had no idea where the problem could be. How were we going to find the cause of the problem in production? What was causing the perfect storm where accepting invitations only half accepted them?
Since there was no error being raised, we couldn't see the details in AppSignal. Yet. We knew how to detect the problematic association state, so we started tracking them after invitations had been accepted, by reporting an error where previously there was none. This allowed us to use AppSignal's error reporting system using a custom error that we created without raising the error and breaking the flow for users. It all happens silently in the background.
# app/controllers/concerns/invitations.rb module Concerns::Invitations def self.included(base) # Call this method for every controller this concern is included in # after performing an action/request. base.after_action :perform_user_organization_association_check end def accept_open_invitations # Invitation id set by the InvitationsController when a user is not signed # in or decides to accept it under a different user account. invitation_id = session[:invitation_id] return unless invitation_id @perform_user_organization_association_check = true invitation = Invitation.find(invitation_id) invitation.accept_for_user(current_user) end def perform_user_organization_association_check # Only perform the check if there was an invitation accepted during # this request. return unless @perform_user_organization_association_check organization_ids = Organization.where(:user_ids.in => [current_user.id]).pluck(:id) user_organization_ids = current_user.organization_ids # The associations match, do not report this as an error return if user_organization_ids.length == organization_ids.length # Add an error to this request that's being tracked by AppSignal. # This makes sure we have all the environment details for this request, # along with this custom error report. Appsignal.set_error(InvalidUserAccountAssociationError.new( <<~ERROR Mismatch in user <=> organization associations. database: #{organization_ids.sort} controller: #{user_organization_ids.sort} ERROR )) end end # Custom error class so we can track them in AppSignal easier class InvalidUserAccountAssociationError < StandardError; end
The important part in this code example is the Appsignal.set_error method call that tracks a custom error for this request without it reraising and showing an error page to the user. You can use this yourself as well, see the documentation.
Chasing the Culprit
We deployed this custom error tracking code and sure enough, the errors started trickling in as people were accepting invitations. After fixing some false positives, we finally had errors being reported that matched the incorrect state of user and organization associations in the database. They were all happening during the sign up flow of new users while accepting an invitation.
So now we had samples of errors with a lot of details that could cause the problem. Now the real detective work started. What of these many details (e.g. request params, session data, tags, etc.) was causing the problem?
It took a couple of failed invitations for us to spot something odd in the request parameters: "accepted_terms": "0",
{ "action": "create", "controller": "users/registrations", "user": { "accepted_terms": "0", "email": "[REDACTED]", "name": "Tom", "password": "[FILTERED]" }, # ... }
During sign up, users need to accept the AppSignal Terms & Conditions before they can use AppSignal. When they do not, the sign up fails and they need to resubmit the form after accepting the Terms & Conditions.
It struck us as odd that this could be causing the problem since we test for an incorrectly filled in sign up form. But we went ahead and tried reproducing the scenario anyway. Sure enough, the following popped up when I tried to accept an invitation without accepting the Terms & Conditions.
Elementary, My Dear
This faulty state was just what we were looking for. If users continued to sign up, accepting the Terms & Conditions, they would end up in the faulty association state between users and organizations. Turns out, that while we test the form to be filled in incorrectly, we didn't test if the Terms & Conditions weren't accepted...
We quickly added another test case, and now that we had found the culprit, we fixed the issue. Thank you Robert for helping me find this exception without error! Relieved, we deployed the fix the next day. Now it should no longer happen.
To ensure that it doesn't, we've kept the reporting logic in the controllers for now so that we can keep an eye on it if it happens again. When we're confident that we've fixed the issue, we'll remove the error reporting code and it will be business as usual again.
🎬 So ends the story of the exception that didn't raise an error. With the culprit safely eliminated.
PS: If you liked this detective story featuring AppSignal, you might also like this story, or write your own story of solving crimes in code helped by an APM like AppSignal.