We've been using the combination of Mongodb, Mongoid (3.x.x) and Sidekiq for a while now and we noticed that lately our queue's were filling up, but we could not pin-point any bottlenecks in our system.
The cpu's of our workers were never maxed out, even with a full queue. MongoDB was hardly locked and network traffic was well under the limits.
When tailing the MongoDB logs we noticed that a lot of new connections were made every second. We knew this happened because Sidekiq creates a new connection for each job.
Soon we found out we were not the only ones having this issue. Avi Tzurel wrote a blogpost about this exact same issue and in the comments is a gist that people have found to work.
We've improved it a bit and made it specific for Ruby 2:
# By default, Sidekiq is going to open a new connection to Mongo for each job and disconnect it afterward because # Mongoid stores the session connection on a Fiber-local variable by using Thread.current. Inspired by # http://avi.io/blog/2013/01/30/problems-with-mongoid-and-sidekiq-brainstorming/, we can override this to put # the sessions at a Thread-local variable, not a Fiber local one. # # TODO Remove when on Mongoid 4, since it uses connection pooling. module Celluloid class Thread < ::Thread def =(key, value) if mongoid_session_key?(key) # In Ruby 2.0, Thread.current[:foo] = "bar" is Fiber-local, whereas # Thread.current.thread_variable_set(:foo, "bar") will be local to the entire Thread and all Fibers running # on it be able to see that variable. As such, storing the Mongoid session on the thread level will let # each Fiber reuse the Mongoid connection: Sidekiq uses Celluloid, which spins up a pool of worker threads # at the designated concurrency level (e.g., by default, Sidekiq uses 25). Celluloid Actors run on those # Threads in Fibers, so each time a Sidekiq job is dispatched to an an Actor, it creates a new Fiber. In doing # this, we have to reconnect to Mongo every single time a job is picked up, and it disconnects when it finishes! # # If you want to see this behavior, an easy way to test it is to create a simple Sidekiq job which just does # something like User.count, then fire up a Sidekiq worker, enqueue a few hundred jobs, and watch Mongo # via mongostat. You'll see connections persist, whereas if you remove this logic, connections will drop # and reconnect each time a job is picked up. thread_variable_set(key, value) else super end end def (key) if mongoid_session_key?(key) thread_variable_get(key) else super end end private def mongoid_session_key?(key) # Just put the sessions data at the Thread level; this leaves things like persistence settings, identity map # disabling, etc. to the individual Fiber being managed by Celluloid. return key.to_s() == "[mongoid]:sessions" end end end
We decided to take the gist and deploy it to one of our workers to see if it improved job throughput.
Our load balancer devides the incoming requests evenly among our workers. Once we deployed our fix to one of the workers we immediately noticed that it was always done with it's jobs in a fraction of the time the other workers take.
Worker one is already done while worker two has just begun processing.
As an added benefit our MongoDB logfiles are actually readable again since the connection message pollution is gone.
Here's a snapshot of our worker's cpu load, te orange line is the time we deployed this fix.
It's been running in production for a few weeks now without any issues.
[note] We're in the process of upgrading to Mongoid 4 and since that uses connection pooling we should be able to remove this patch.