Project

General

Profile

Actions

Bug #16543

open

if qpid is slow/unresponsive, candlepin event listener will freeze in event loop, causing dynflow executor to stop responding

Added by Chris Duryee over 7 years ago. Updated almost 6 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Difficulty:
Triaged:
Fixed in Releases:
Found in Releases:

Description

If you suspend the qpid process (simulating high qpid load) the candlepin event listener process will hang in its event loop. This causes the dynflow executor to not proceed.

For example:

  • on tasks page, view candlepin task process, it should say something like:
{"messages"=>"b190e333-7821-302a-8131-9693e66e2144",
 "last_message"=>"b190e333-7821-302a-8131-9693e66e2144 - import.created",
 "error"=>nil,
 "connection"=>"Connected"}
  • now, freeze the qpidd process: kill -19 `pidof qpidd`. Note that the candlepin event listener still thinks its connected.
  • do a "foreman-rake console" and run Katello::Ping.ping(services: [:foreman_tasks]). Note that the executor failed to respond.

Once qpidd is unsuspended via kill -18, things will run normally again.

note: this is the call that hangs: https://github.com/Katello/katello/blob/master/app/lib/actions/candlepin/candlepin_listening_service.rb#L42. The timeout appears to not be honored. I suspect that if qpidd was unresponsive for long enough, the kernel would eventually realize to sever the TCP connection. I have not tested freezing it for over 20 min, but it could take a couple of hours for this to happen. I suspect that the listener would not create a new connection at this point, and if qpid came back, events would start piling up on the katello_event_queue. http://john.eckersberg.com/improving-ha-failures-with-tcp-timeouts.html may provide additional clues about this.

Actions #1

Updated by Chris Duryee over 7 years ago

  • Description updated (diff)
Actions #2

Updated by Eric Helms over 7 years ago

  • translation missing: en.field_release set to 188
Actions #3

Updated by Justin Sherrill about 7 years ago

  • translation missing: en.field_release changed from 188 to 114
Actions

Also available in: Atom PDF