Project

General

Profile

Bug #16543

if qpid is slow/unresponsive, candlepin event listener will freeze in event loop, causing dynflow executor to stop responding

Added by Chris Duryee almost 3 years ago. Updated 11 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Difficulty:
Triaged:
Yes
Bugzilla link:
Pull request:
Team Backlog:
Fixed in Releases:
Found in Releases:

Description

If you suspend the qpid process (simulating high qpid load) the candlepin event listener process will hang in its event loop. This causes the dynflow executor to not proceed.

For example:

  • on tasks page, view candlepin task process, it should say something like:
{"messages"=>"b190e333-7821-302a-8131-9693e66e2144",
 "last_message"=>"b190e333-7821-302a-8131-9693e66e2144 - import.created",
 "error"=>nil,
 "connection"=>"Connected"}
  • now, freeze the qpidd process: kill -19 `pidof qpidd`. Note that the candlepin event listener still thinks its connected.
  • do a "foreman-rake console" and run Katello::Ping.ping(services: [:foreman_tasks]). Note that the executor failed to respond.

Once qpidd is unsuspended via kill -18, things will run normally again.

note: this is the call that hangs: https://github.com/Katello/katello/blob/master/app/lib/actions/candlepin/candlepin_listening_service.rb#L42. The timeout appears to not be honored. I suspect that if qpidd was unresponsive for long enough, the kernel would eventually realize to sever the TCP connection. I have not tested freezing it for over 20 min, but it could take a couple of hours for this to happen. I suspect that the listener would not create a new connection at this point, and if qpid came back, events would start piling up on the katello_event_queue. http://john.eckersberg.com/improving-ha-failures-with-tcp-timeouts.html may provide additional clues about this.

History

#1 Updated by Chris Duryee almost 3 years ago

  • Description updated (diff)

#2 Updated by Eric Helms over 2 years ago

  • Legacy Backlogs Release (now unused) set to 188

#3 Updated by Justin Sherrill over 2 years ago

  • Legacy Backlogs Release (now unused) changed from 188 to 114

Also available in: Atom PDF