Project

General

Profile

Actions

Bug #34485

open

Dynflow doesn't properly come back if the DB is unavailable for a brief period of time

Added by Amit Upadhye over 3 years ago. Updated about 17 hours ago.

Status:
New
Priority:
High
Assignee:
-
Category:
-
Target version:
-
Difficulty:
Triaged:
No
Fixed in Releases:
Found in Releases:

Description

Ohai,

when I try to restore a Katello 4.3 (or nightly) on EL8 (with rubygem-foreman_maintain-1.0.2-1.el8.noarch), the restore finishes fine, but afterwards not all services are happy:

# hammer ping
database:         
    Status:          ok
    Server Response: Duration: 0ms
katello_agent:    
    Status:          ok
    message:         0 Processed, 0 Failed
    Server Response: Duration: 0ms
candlepin:        
    Status:          ok
    Server Response: Duration: 58ms
candlepin_auth:   
    Status:          ok
    Server Response: Duration: 50ms
candlepin_events: 
    Status:          ok
    message:         0 Processed, 0 Failed
    Server Response: Duration: 0ms
katello_events:   
    Status:          ok
    message:         0 Processed, 0 Failed
    Server Response: Duration: 1ms
pulp3:            
    Status:          ok
    Server Response: Duration: 117ms
pulp3_content:    
    Status:          ok
    Server Response: Duration: 128ms
foreman_tasks:    
    Status:          FAIL
    Server Response: Message: some executors are not responding, check /foreman_tasks/dynflow/status

After restarting dynflow via systemctl restart dynflow-sidekiq@\* everything seems to work fine again.

I am not sure this is a maintain bug (or installer, or dynflow, or packaging), but filing it here for investigation.


Related issues 1 (0 open1 closed)

Copied from Installer - Bug #34394: Dynflow doesn't properly come back if the DB is unavailable for a brief period of timeClosedEvgeni GolovActions
Actions #1

Updated by Amit Upadhye over 3 years ago

  • Copied from Bug #34394: Dynflow doesn't properly come back if the DB is unavailable for a brief period of time added
Actions #2

Updated by Adam Winberg over 2 years ago

This causes remote execution jobs to fail to report correct status - the job executes fine but the job status is stuck at 'running 100%'.

After a restart of the dynflow-sidekiq services jobs status is reported correctly again, but those jobs that failed to report correctly are forever stuck at 'running 100%'.

We do not run postgres locally on our Foreman server so the puppet manifest workaround does not work for us.

Actions #3

Updated by markus lanz over 2 years ago

This also applies to us. We also dont run the progresql DB locally on the foreman server. We are running: Version 3.5.1. Are there any updates/news regarding this topic?

Actions #4

Updated by Adam Ruzicka over 2 years ago

What are your expectations around it? It could be made to survive brief (couple of seconds) connection drops, but definitely not more. Would that be ok?

Actions #5

Updated by Adam Winberg over 2 years ago

What are your expectations around it?

For me, that it should survive at least a minute or two of disconnect to allow for db patching and reboots. Any less than that would seldom be useful in our environmnent.

Actions #6

Updated by markus lanz over 2 years ago

For me, that it should survive at least a minute or two of disconnect to allow for db patching and reboots. Any less than that would seldom be useful in our environmnent.

I'll agree.

Actions #7

Updated by Adam Ruzicka over 2 years ago

That's not going to fly I'm afraid. It could be made that instead of getting stuck like it does now, it would crash after a while and systemd could then restart it automatically. This would keep happening until the db comes back up. However I'd like to avoid having a service sitting there for minutes, seemingly ok while in fact it cannot function at all.

Actions #8

Updated by markus lanz over 2 years ago

That's not going to fly I'm afraid. It could be made that instead of getting stuck like it does now, it would crash after a while and systemd could then restart it automatically. This would keep happening until the db comes back up. However I'd like to avoid having a service sitting there for minutes, seemingly ok while in fact it cannot function at all.

Understandable and i agree. However as a compromise, a few seconds should also do the trick. In most environments, databases are setup with High Availability machanisms that will failover in a few seconds. So i guess we dont have to think about minutes. (10-20 seconds should be more than enough.)

Actions #9

Updated by Jeff S 1 day ago

  • Priority changed from Normal to High
  • Found in Releases foreman_tasks-11.0.7 added

Is there any update on this? I see its years old, but we are having the same issue. A few seconds drop in connection to the PSQL DB causes the orchestrator to stop working. Worse, it doesnt even tell the service that there is anything wrong, so the service stays up/active. Even worse, when this happens in an HA Foreman cluster, it really really messes up all the jobs, since all orchestrators get stuck in passive/passive mode, or active/active. At that point you need to go through multiple steps in the DB to fix it. This is a pretty serious issue.

[root@10-222-206-152 ~]# dnf list installed | grep dynf
dynflow-utils.x86_64 1.6.3-1.el9 @foreman
foreman-dynflow-sidekiq.noarch 3.15.0-1.el9 @foreman
rubygem-dynflow.noarch 1.9.1-1.el9 @foreman
rubygem-smart_proxy_dynflow.noarch 0.9.4-1.fm3_14.el9 @foreman-plugins
[root@10-222-206-152 ~]# dnf list installed | grep tasks
rubygem-foreman-tasks.noarch 11.0.0-1.fm3_15.el9 @foreman-plugins
rubygem-hammer_cli_foreman_tasks.noarch 0.0.22-1.fm3_15.el9 @foreman-plugins

Actions #10

Updated by Adam Ruzicka 1 day ago

No, not really. Fixing this completely (as in "I want to be able to remove the database at any time for any amount of time and expect things to continue working") would be non-trivial and not that many people run into it.

Some time ago I put together dynflow/dynflow#4271 which could help with brief outages, but it never gained much traction.

[1] - https://github.com/Dynflow/dynflow/pull/427

Actions #11

Updated by Adam Winberg about 17 hours ago

I ended up writing my own monitor-script in python, using psycopg2. It's a simple script, it establishes a connection to the db and then runs a simple query every 3s. If the query fails it will try to reconnect to the db every 3s, and when that succeeds it restarts the dynflow-sidekiq@* service. It actually works really well. :)

Actions

Also available in: Atom PDF