Bug #34485
open
Dynflow doesn't properly come back if the DB is unavailable for a brief period of time
Added by Amit Upadhye almost 3 years ago.
Updated over 1 year ago.
Description
Ohai,
when I try to restore a Katello 4.3 (or nightly) on EL8 (with rubygem-foreman_maintain-1.0.2-1.el8.noarch), the restore finishes fine, but afterwards not all services are happy:
# hammer ping
database:
Status: ok
Server Response: Duration: 0ms
katello_agent:
Status: ok
message: 0 Processed, 0 Failed
Server Response: Duration: 0ms
candlepin:
Status: ok
Server Response: Duration: 58ms
candlepin_auth:
Status: ok
Server Response: Duration: 50ms
candlepin_events:
Status: ok
message: 0 Processed, 0 Failed
Server Response: Duration: 0ms
katello_events:
Status: ok
message: 0 Processed, 0 Failed
Server Response: Duration: 1ms
pulp3:
Status: ok
Server Response: Duration: 117ms
pulp3_content:
Status: ok
Server Response: Duration: 128ms
foreman_tasks:
Status: FAIL
Server Response: Message: some executors are not responding, check /foreman_tasks/dynflow/status
After restarting dynflow
via systemctl restart dynflow-sidekiq@\*
everything seems to work fine again.
I am not sure this is a maintain bug (or installer, or dynflow, or packaging), but filing it here for investigation.
- Copied from Bug #34394: Dynflow doesn't properly come back if the DB is unavailable for a brief period of time added
This causes remote execution jobs to fail to report correct status - the job executes fine but the job status is stuck at 'running 100%'.
After a restart of the dynflow-sidekiq services jobs status is reported correctly again, but those jobs that failed to report correctly are forever stuck at 'running 100%'.
We do not run postgres locally on our Foreman server so the puppet manifest workaround does not work for us.
This also applies to us. We also dont run the progresql DB locally on the foreman server. We are running: Version 3.5.1. Are there any updates/news regarding this topic?
What are your expectations around it? It could be made to survive brief (couple of seconds) connection drops, but definitely not more. Would that be ok?
What are your expectations around it?
For me, that it should survive at least a minute or two of disconnect to allow for db patching and reboots. Any less than that would seldom be useful in our environmnent.
For me, that it should survive at least a minute or two of disconnect to allow for db patching and reboots. Any less than that would seldom be useful in our environmnent.
I'll agree.
That's not going to fly I'm afraid. It could be made that instead of getting stuck like it does now, it would crash after a while and systemd could then restart it automatically. This would keep happening until the db comes back up. However I'd like to avoid having a service sitting there for minutes, seemingly ok while in fact it cannot function at all.
That's not going to fly I'm afraid. It could be made that instead of getting stuck like it does now, it would crash after a while and systemd could then restart it automatically. This would keep happening until the db comes back up. However I'd like to avoid having a service sitting there for minutes, seemingly ok while in fact it cannot function at all.
Understandable and i agree. However as a compromise, a few seconds should also do the trick. In most environments, databases are setup with High Availability machanisms that will failover in a few seconds. So i guess we dont have to think about minutes. (10-20 seconds should be more than enough.)
Also available in: Atom
PDF