Bug #34485
openDynflow doesn't properly come back if the DB is unavailable for a brief period of time
Description
Ohai,
when I try to restore a Katello 4.3 (or nightly) on EL8 (with rubygem-foreman_maintain-1.0.2-1.el8.noarch), the restore finishes fine, but afterwards not all services are happy:
# hammer ping database: Status: ok Server Response: Duration: 0ms katello_agent: Status: ok message: 0 Processed, 0 Failed Server Response: Duration: 0ms candlepin: Status: ok Server Response: Duration: 58ms candlepin_auth: Status: ok Server Response: Duration: 50ms candlepin_events: Status: ok message: 0 Processed, 0 Failed Server Response: Duration: 0ms katello_events: Status: ok message: 0 Processed, 0 Failed Server Response: Duration: 1ms pulp3: Status: ok Server Response: Duration: 117ms pulp3_content: Status: ok Server Response: Duration: 128ms foreman_tasks: Status: FAIL Server Response: Message: some executors are not responding, check /foreman_tasks/dynflow/status
After restarting dynflow
via systemctl restart dynflow-sidekiq@\*
everything seems to work fine again.
I am not sure this is a maintain bug (or installer, or dynflow, or packaging), but filing it here for investigation.
Updated by Amit Upadhye over 2 years ago
- Copied from Bug #34394: Dynflow doesn't properly come back if the DB is unavailable for a brief period of time added
Updated by Adam Winberg over 1 year ago
This causes remote execution jobs to fail to report correct status - the job executes fine but the job status is stuck at 'running 100%'.
After a restart of the dynflow-sidekiq services jobs status is reported correctly again, but those jobs that failed to report correctly are forever stuck at 'running 100%'.
We do not run postgres locally on our Foreman server so the puppet manifest workaround does not work for us.
Updated by markus lanz over 1 year ago
This also applies to us. We also dont run the progresql DB locally on the foreman server. We are running: Version 3.5.1. Are there any updates/news regarding this topic?
Updated by Adam Ruzicka over 1 year ago
What are your expectations around it? It could be made to survive brief (couple of seconds) connection drops, but definitely not more. Would that be ok?
Updated by Adam Winberg over 1 year ago
What are your expectations around it?
For me, that it should survive at least a minute or two of disconnect to allow for db patching and reboots. Any less than that would seldom be useful in our environmnent.
Updated by markus lanz over 1 year ago
For me, that it should survive at least a minute or two of disconnect to allow for db patching and reboots. Any less than that would seldom be useful in our environmnent.
I'll agree.
Updated by Adam Ruzicka over 1 year ago
That's not going to fly I'm afraid. It could be made that instead of getting stuck like it does now, it would crash after a while and systemd could then restart it automatically. This would keep happening until the db comes back up. However I'd like to avoid having a service sitting there for minutes, seemingly ok while in fact it cannot function at all.
Updated by markus lanz over 1 year ago
That's not going to fly I'm afraid. It could be made that instead of getting stuck like it does now, it would crash after a while and systemd could then restart it automatically. This would keep happening until the db comes back up. However I'd like to avoid having a service sitting there for minutes, seemingly ok while in fact it cannot function at all.
Understandable and i agree. However as a compromise, a few seconds should also do the trick. In most environments, databases are setup with High Availability machanisms that will failover in a few seconds. So i guess we dont have to think about minutes. (10-20 seconds should be more than enough.)