Project

General

Profile

Actions

Bug #34485

open

Dynflow doesn't properly come back if the DB is unavailable for a brief period of time

Added by Amit Upadhye almost 3 years ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Difficulty:
Triaged:
No
Fixed in Releases:
Found in Releases:

Description

Ohai,

when I try to restore a Katello 4.3 (or nightly) on EL8 (with rubygem-foreman_maintain-1.0.2-1.el8.noarch), the restore finishes fine, but afterwards not all services are happy:

# hammer ping
database:         
    Status:          ok
    Server Response: Duration: 0ms
katello_agent:    
    Status:          ok
    message:         0 Processed, 0 Failed
    Server Response: Duration: 0ms
candlepin:        
    Status:          ok
    Server Response: Duration: 58ms
candlepin_auth:   
    Status:          ok
    Server Response: Duration: 50ms
candlepin_events: 
    Status:          ok
    message:         0 Processed, 0 Failed
    Server Response: Duration: 0ms
katello_events:   
    Status:          ok
    message:         0 Processed, 0 Failed
    Server Response: Duration: 1ms
pulp3:            
    Status:          ok
    Server Response: Duration: 117ms
pulp3_content:    
    Status:          ok
    Server Response: Duration: 128ms
foreman_tasks:    
    Status:          FAIL
    Server Response: Message: some executors are not responding, check /foreman_tasks/dynflow/status

After restarting dynflow via systemctl restart dynflow-sidekiq@\* everything seems to work fine again.

I am not sure this is a maintain bug (or installer, or dynflow, or packaging), but filing it here for investigation.


Related issues 1 (0 open1 closed)

Copied from Installer - Bug #34394: Dynflow doesn't properly come back if the DB is unavailable for a brief period of timeClosedEvgeni GolovActions
Actions #1

Updated by Amit Upadhye almost 3 years ago

  • Copied from Bug #34394: Dynflow doesn't properly come back if the DB is unavailable for a brief period of time added
Actions #2

Updated by Adam Winberg almost 2 years ago

This causes remote execution jobs to fail to report correct status - the job executes fine but the job status is stuck at 'running 100%'.

After a restart of the dynflow-sidekiq services jobs status is reported correctly again, but those jobs that failed to report correctly are forever stuck at 'running 100%'.

We do not run postgres locally on our Foreman server so the puppet manifest workaround does not work for us.

Actions #3

Updated by markus lanz over 1 year ago

This also applies to us. We also dont run the progresql DB locally on the foreman server. We are running: Version 3.5.1. Are there any updates/news regarding this topic?

Actions #4

Updated by Adam Ruzicka over 1 year ago

What are your expectations around it? It could be made to survive brief (couple of seconds) connection drops, but definitely not more. Would that be ok?

Actions #5

Updated by Adam Winberg over 1 year ago

What are your expectations around it?

For me, that it should survive at least a minute or two of disconnect to allow for db patching and reboots. Any less than that would seldom be useful in our environmnent.

Actions #6

Updated by markus lanz over 1 year ago

For me, that it should survive at least a minute or two of disconnect to allow for db patching and reboots. Any less than that would seldom be useful in our environmnent.

I'll agree.

Actions #7

Updated by Adam Ruzicka over 1 year ago

That's not going to fly I'm afraid. It could be made that instead of getting stuck like it does now, it would crash after a while and systemd could then restart it automatically. This would keep happening until the db comes back up. However I'd like to avoid having a service sitting there for minutes, seemingly ok while in fact it cannot function at all.

Actions #8

Updated by markus lanz over 1 year ago

That's not going to fly I'm afraid. It could be made that instead of getting stuck like it does now, it would crash after a while and systemd could then restart it automatically. This would keep happening until the db comes back up. However I'd like to avoid having a service sitting there for minutes, seemingly ok while in fact it cannot function at all.

Understandable and i agree. However as a compromise, a few seconds should also do the trick. In most environments, databases are setup with High Availability machanisms that will failover in a few seconds. So i guess we dont have to think about minutes. (10-20 seconds should be more than enough.)

Actions

Also available in: Atom PDF