Bug #22935
closedFirst bulk REX after smart_proxy_dynflow_core can raise "Could not use any Capsule" error
Description
Cloned from https://bugzilla.redhat.com/show_bug.cgi?id=1558069
Description of problem:
When having one Capsule for REX (or when only one Caps can be used for REX to some hosts), "Could not use any Capsule" error might happen after a fresh restart of smart_proxy_dynflow_core (spdc) service, any time delay, and then a bulk of REX jobs.
Per aruzicka++, it happens because:
- restarting or reloading this service itself does not load DB scheme nor apply DB migrations
- that is done during the very first REX job / probe of this Capsule "liveness"
- if multiple REX jobs are scheduled to the same time and multiple such queries "is this Capsule running?" are raised concurrently, a race condition followed by some SQL error in spdc logs can cause the probe fails
- if this Caps is the only available for some host, whole REX job for the host fails with "Could not use any Capsule" error
Version-Release number of selected component (if applicable):
any (incl. Sat 6.2.14 and 6.3.0)
How reproducible:
100% (in few attempts, the worst)
Steps to Reproduce:
0) Have Sat without another Caps and REX working to some host (1 is enough)
1) restart spdc service:
service smart_proxy_dynflow_core restart
2) invoke more REX jobs in near future:
cat "date" > /tmp/rex-date
- update --start-at to some soon-in-future time, and update "MYHOST" to some host you have, optionally update --job-template-id to SSH REX one
for i in $(seq 1 100); do echo "job-invocation create --job-template-id 94 --input-files command=/tmp/rex-date --search-query \"name ~ MYHOST\" --async --start-at \"2018-03-19T13:50:00\""; done | hammer -u admin -p redhat shell
3) observe if all jobs succeeded
Actual results:
3) few very first jobs (usually two) fail - not granted, depends on how fast they trigger the probe to spdc
Expected results:
3) no such errors
Additional info:
workaround: run a dummy REX job against that Capsule after any spdc service reload and/or restart
Updated by The Foreman Bot about 6 years ago
- Status changed from New to Ready For Testing
- Assignee set to Adam Ruzicka
- Pull request https://github.com/theforeman/smart_proxy_dynflow/pull/48 added
Updated by Adam Ruzicka about 6 years ago
- Status changed from Ready For Testing to Closed
- % Done changed from 0 to 100
Applied in changeset foreman_proxy_dynflow|2674c557728fbf7a3792793383cb7ab1d244f059.