Project

General

Profile

Actions

Bug #34800

closed

Restarting postgres just before task finish causes discrepancy between foreman and dynflow task status - forever

Added by Adam Ruzicka about 3 years ago. Updated over 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Difficulty:
Triaged:
No
Fixed in Releases:
Found in Releases:

Description

Cloned from https://bugzilla.redhat.com/show_bug.cgi?id=2073847

Description of problem:
When postgres service is restarted (i.e. as part of all services restart or alone) when dynflow is about to complete a task, then the task can end up hung in a few invalid situations forever.

"Invalid situation" means e.g.:
- foreman sees the task as stopped/pending while dynflow sees it as stopped/succes
- or foreman sees the task as running/pending while dynflow sees it as stopped/success

"Forever" means there is no user action to fix the status, like:
- services restart doesnt help
- force unlock can move foreman task from running/pending to stopped/pending, but nothing else

Also, until force unlock is done, such stuck task can have acquired its object(s) lock(s).

Version-Release number of selected component (if applicable):
Sat6.10.4

How reproducible:
100% within a few attempts

Steps to Reproduce:
One particular reproducer is to Destroy a CV and just at the end, restart postgres service. It can be VERY tricky to guess the "at the end", so the script below checks for number of completed pulp tasks - for a CV with one repo, the ContentView::Destroy task triggers one pulp task. So whenever the script detects as many new completed pulp tasks as the number of being-destroyed CVs is, the script restarts postgres.

Script itself:

--------8<----------------8<----------------8<--------
CONCUR=${1:-5}
REPOIDS=${2:-51}
hmr="hammer shell"

prepare_cv_to_delete() {
CVID=$1
( echo "content-view create --organization-id=1 --name cv_zoos_${CVID} --repository-ids ${REPOIDS}"
echo "content-view publish --organization-id=1 --name cv_zoos_${CVID}"
echo "content-view remove-from-environment --organization-id=1 --name=cv_zoos_${CVID} --lifecycle-environment-id=1"
echo "content-view version delete --content-view=cv_zoos_${CVID} --version 1.0 --organization-id 1"
) | $hmr
}

for i in $(seq 1 $CONCUR); do
prepare_cv_to_delete $i &
done

echo "waiting for CVs create+almost-delete"
time wait

for i in $(seq 1 $CONCUR); do
hammer content-view delete --name=cv_zoos_${i} --organization-id 1 &
done

echo "$(date): waiting for CVs delete"
tasks=$(su - postgres c "psql pulpcore -c \"copy (select count() from core_task) to stdout;\"")
echo "$(date): waiting for CVs delete, pulp tasks=${tasks}"
expected=$((tasks+CONCUR))
tasks=0
while [ $tasks -lt $expected ]; do
tasks=$(su - postgres -c "psql pulpcore -c \"copy (select count(
) from core_task) to stdout;\"")
sleep 0.5
done
#su - postgres -c "psql pulpcore -c \"select count() from core_task;\""
echo "$(date): restarting postgres as having tasks=${tasks}"
systemctl restart rh-postgresql12-postgresql.service
date
time wait
su - postgres -c "psql pulpcore -c \"select count(
) from core_task;\""
--------8<----------------8<----------------8<-------

Usage:

./create_delete_cv_restart_postgres.sh 5 REPOID

where REPOID is an id of a small repo

Actual results:
Random tasks tuck forever, optionally with acquired locks.

As an example, see attached task export.

Expected results:
No such stuck tasks forever. Tasks should be recoverable by a restart or manual (Skip&)Resume.

Additional info:

Actions #1

Updated by The Foreman Bot about 3 years ago

  • Status changed from New to Ready For Testing
  • Assignee set to Adam Ruzicka
  • Pull request https://github.com/theforeman/foreman-tasks/pull/681 added
Actions #2

Updated by The Foreman Bot over 2 years ago

  • Fixed in Releases foreman-tasks-7.1.1 added
Actions #3

Updated by Adam Ruzicka over 2 years ago

  • Status changed from Ready For Testing to Closed
Actions

Also available in: Atom PDF