-
Notifications
You must be signed in to change notification settings - Fork 29
Description
Initially investigated in https://github.com/travis-pro/travis-enterprise/issues/299 (apologies for the private link)
When logs-drain
is left long enough it begins to leak threads. Working with the customer and on our own test instance we came across ruby-concurrency/concurrent-ruby#639. I estimated that this may have to do with the rabbit reconnection bits being longer than the execution and time out of the timer task. This did seem to help initially but we still had a problem. For now we hot patched the issue in 2.2.1 by restarting the service every 8 hours. Heroku does a courtesy restart every 24 which is probably why we never saw this issue in hosted - also seem like busy servers don't seem to have this issue which is why I thought it was a rabbitmq heartbeat timeout issue (and we used to have a thread leak around that too that we fixed in #171.
After diving in I think we may need to consider rewriting the loop in drain to not use TimerTask
(as this is a very old issue) as one project did (ruby-shoryuken/shoryuken#338 and ruby-shoryuken/shoryuken#345). Or fix TimerTask
in concurrent-ruby
.
Here's some copied context from the private issue:
Left over night with higher execution timer and no connection recovery [in bunny] it looks like it still crashed (new pid) and is sitting at 60 some pids [instead of 30]. Looking at the threads it looks like this:
root@te-main:/# ps -T -p 10366
PID SPID TTY TIME CMD
10366 10366 ? 00:00:08 ruby
10366 11046 ? 00:00:03 ruby-timer-thr
10366 11160 ? 00:00:00 exceptions.rb:*
10366 11267 ? 00:16:25 ruby_thread_po*
10366 495 ? 00:00:10 ruby_thread_po*
10366 908 ? 00:00:09 ruby_thread_po*
10366 1067 ? 00:00:07 ruby_thread_po*
10366 1763 ? 00:00:06 ruby_thread_po*
10366 1908 ? 00:00:06 ruby_thread_po*
10366 2081 ? 00:00:06 ruby_thread_po*
10366 2082 ? 00:00:05 ruby_thread_po*
10366 2164 ? 00:00:05 ruby_thread_po*
10366 2173 ? 00:00:04 ruby_thread_po*
10366 2641 ? 00:00:04 ruby_thread_po*
10366 2643 ? 00:00:03 ruby_thread_po*
10366 2650 ? 00:00:04 ruby_thread_po*
10366 2658 ? 00:00:03 ruby_thread_po*
10366 2660 ? 00:00:02 ruby_thread_po*
10366 2662 ? 00:00:04 ruby_thread_po*
10366 2950 ? 00:00:02 ruby_thread_po*
10366 2951 ? 00:00:02 ruby_thread_po*
10366 2967 ? 00:00:02 ruby_thread_po*
10366 3198 ? 00:00:01 ruby_thread_po*
10366 3199 ? 00:00:00 ruby_thread_po*
10366 3201 ? 00:00:01 ruby_thread_po*
10366 3202 ? 00:00:00 ruby_thread_po*
10366 3204 ? 00:00:00 ruby_thread_po*
10366 3271 ? 00:00:01 ruby_thread_po*
10366 3273 ? 00:00:01 ruby_thread_po*
10366 3366 ? 00:00:00 ruby_thread_po*
10366 3367 ? 00:00:00 ruby_thread_po*
10366 3368 ? 00:00:00 ruby_thread_po*
10366 3369 ? 00:00:00 ruby_thread_po*
10366 3370 ? 00:00:00 ruby_thread_po*
10366 3371 ? 00:00:00 ruby_thread_po*
10366 3372 ? 00:00:00 ruby_thread_po*
10366 3373 ? 00:00:00 ruby_thread_po*
10366 3375 ? 00:00:00 ruby_thread_po*
10366 3376 ? 00:00:01 ruby_thread_po*
10366 3377 ? 00:00:00 ruby_thread_po*
10366 3378 ? 00:00:00 ruby_thread_po*
10366 3379 ? 00:00:00 ruby_thread_po*
10366 3380 ? 00:00:00 ruby_thread_po*
10366 3381 ? 00:00:00 ruby_thread_po*
10366 3382 ? 00:00:01 ruby_thread_po*
10366 3383 ? 00:00:00 ruby_thread_po*
10366 3384 ? 00:00:00 ruby_thread_po*
10366 3385 ? 00:00:00 ruby_thread_po*
10366 3386 ? 00:00:00 ruby_thread_po*
10366 3387 ? 00:00:00 ruby_thread_po*
10366 3388 ? 00:00:00 ruby_thread_po*
10366 3389 ? 00:00:01 ruby_thread_po*
10366 3390 ? 00:00:01 ruby_thread_po*
10366 3586 ? 00:00:00 heartbeat_send*
10366 3587 ? 00:00:00 reader_loop.rb*
10366 3589 ? 00:00:00 consumer_work_*
10366 3591 ? 00:00:00 heartbeat_send*
10366 3592 ? 00:00:00 reader_loop.rb*
10366 3594 ? 00:00:00 consumer_work_*
10366 3596 ? 00:00:00 heartbeat_send*
10366 3597 ? 00:00:00 reader_loop.rb*
10366 3599 ? 00:00:00 consumer_work_*
10366 3601 ? 00:00:00 heartbeat_send*
10366 3602 ? 00:00:00 reader_loop.rb*
10366 3604 ? 00:00:00 consumer_work_*
10366 3625 ? 00:00:00 heartbeat_send*
10366 3626 ? 00:00:00 reader_loop.rb*
10366 3628 ? 00:00:00 consumer_work_*
Beginning to feel our only option may be rewriting this loop away from TimerTask
as shoryuken did. Try to put together a spike for that today or tomorrow.
Profiling the top CPU user up there:
Summary of profiling data so far:
% self % total name
95.34 100.00 <c function> - unknown
0.49 71.44 synchronize - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/synchronization/mri_lockable_object.rb
0.31 71.10 block in synchronize - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/synchronization/mri_lockable_object.rb
0.21 5.32 safe_execute - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/ivar.rb
0.19 4.33 execute_task - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/timer_task.rb
0.16 66.80 block in process_tasks - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/executor/timer_set.rb
0.16 0.61 block in initialize - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/scheduled_task.rb
0.12 1.74 execute - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/scheduled_task.rb
0.09 64.98 block in ns_wait_until - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/synchronization/abstract_lockable_object.rb
0.09 0.95 format_l2met - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/bundler/gems/travis-logger-b589e0ca0e9f/lib/travis/logger/format.rb
0.09 0.52 ordered? - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/collection/ruby_non_concurrent_priority_queue.rb
0.09 0.38 schedule_time - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/scheduled_task.rb
0.09 0.22 initialize - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/atomic/event.rb
0.09 0.22 block (2 levels) in l2met_args_to_record - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/bundler/gems/travis-logger-b589e0ca0e9f/lib/travis/logger/format.rb
0.07 1.54 new - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/synchronization/object.rb
0.07 0.67 sink - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/collection/ruby_non_concurrent_priority_queue.rb
0.07 0.52 block in post - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/executor/ruby_executor_service.rb
0.07 0.18 block (3 levels) in <class:Logger> - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/bundler/gems/travis-logger-b589e0ca0e9f/lib/travis/logger.rb
0.06 65.08 block in wait - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/atomic/event.rb
0.06 4.45 block in execute - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/executor/safe_task_executor.rb