Project

General

Profile

Actions

Feature #16938

closed

Katello services to be configured to restart on failure

Added by Stephen Benjamin about 8 years ago. Updated over 1 year ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
Installer
Target version:
Difficulty:
Triaged:
No
Fixed in Releases:
Found in Releases:

Description

Cloned from https://bugzilla.redhat.com/show_bug.cgi?id=1310111
Description of problem:
Request:
configure Sat6 essential services to automatically restart on failure.

Reasoning:
Sat6 relies on various services that are essential for various functionality of the product. If such a service fails due to whatever reason (say, segfault), the functionality is temporarily disabled until an administrator intervention. That often comes only at the end of the sequence: some service failed -> some functionality doesnt work -> customer not notified / doesnt check logs -> after some time, they realize the functionality does not work -> raising support case to Red Hat -> takes time for us to identify the cause -> service restarted.

The functionality downtime and Red Hat support intervention is ridiculously high.

(Sat6 health-check script would alleviate this pain, to some extend. But even with that, the request will still be valid. Technically health-check script is just a different for of logs that doesnt restart failed service itself)

On technical level:
- not sure if applicable to RHEL6 where manual changes to each and every init script would have to be done. I am ok doing so for RHEL7 and updating systemd config only
- ideally, systemd service should be configured to restart any failed/killed/.. service several times in a row and then give up - or optionally try to restart the service with some nontrivial delay between the attempts
- essential/critical services: basically to cover "katello-service status" services

Version-Release number of selected component (if applicable):
6.1.6

How reproducible:
100%

Steps to Reproduce:
1. Mimic a service failure by killing it (an example: kill qdrouterd)
2. Wait some time to allow Sat to reheal
3. Ty the failed functionality (an example: install some errata that relies on qdrouterd)

Actual results:
3. fails regardless of the delay in 2.

Expected results:
3. to succeed after some time without any intervention

Additional info:

Actions

Also available in: Atom PDF