Project

General

Profile

Feature #15062

Check for and block upgrade process if there are paused tasks

Added by Partha Aji about 3 years ago. Updated 10 months ago.

Status:
Closed
Priority:
Urgent
Assignee:
Category:
-
Target version:
Team Backlog:
Fixed in Releases:
Found in Releases:

Description

Cloned from https://bugzilla.redhat.com/show_bug.cgi?id=1315269
++ This bug was initially created as a clone of Bug #1262447 ++

Description of problem:
Customer has upgraded their satellite 6.0.8 to 6.1.1 without knowing that they have some paused tasks. Customer informed that the upgrade didn't go through fine and he executed "katello-installer --upgrade" command couple of times. Now they are hitting multiple issues.

1. Not able to access sync_status page and they were getting the following error.

>> We're sorry, but something went wrong. We've been notified about this issue and we'll take a look at it shortly.

2. The katello-service restart/status shows qpidd and foreman-proxy is having issues. Started the foreman-proxy service manually and it got stopped again in couple of seconds. From the /var/log/messages we are able to see this.

>>  [Protocol] error Connection qpid.172.20.34.32:5671-172.20.34.32:41809 timed out: closing

3. Also hammer ping shows there is an issue with candlepin service. Looking at the tomcat logs, there seems to be an memory leak issue. Attach tomcat_logs.txt to this bugzilla.

Version-Release number of selected component (if applicable):

How reproducible:
Always in customer environment.

Steps to Reproduce:
Upgrade broken lot of functions in the satellite server. Few things noticed during the remote session are the following.

1. Unable to browse the sync_status page.
2. Hammer command fails because of candlepin not reachable.

Actual results:
"katello-installer --upgrade" made the satellite un-usable.

Expected results:
"katello-installer --upgrade" should be successful.

Additional info:
- sosreport is attached with this message.

- Few outputs gathered during the remote session:

[root@mtl1prcm04 ~]# hammer ping
[Foreman] Username: dubuissonj
[Foreman] Password for dubuissonj:
candlepin:
Status: FAIL
Server Response: Message: 404 Resource Not Found
candlepin_auth:
Status: FAIL
Server Response: Message: Katello::Resources::Candlepin::CandlepinPing: 404 Resource Not Found <html><head><title>Apache Tomcat/6.0.24 - Error report</title><style><!--H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} H3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} P {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A {color : black;}A.name {color : black;}HR {color : #525D76;}--></style> </head><body><h1>HTTP Status 404 - </h1><HR size="1" noshade="noshade"><p><b>type</b> Status report</p><p><b>message</b> <u></u></p><p><b>description</b> <u>The requested resource () is not available.</u></p><HR size="1" noshade="noshade"><h3>Apache Tomcat/6.0.24</h3></body></html> (GET /candlepin/status)
pulp:
Status: ok
Server Response: Duration: 29ms
pulp_auth:
Status: ok
Server Response: Duration: 18ms
elasticsearch:
Status: ok
Server Response: Duration: 22ms
foreman_tasks:
Status: ok
Server Response: Duration: 0ms

- They were having nearly 70-80 paused tasks in their dynflow console.

659dc82d-9c4a-468d-89d5-a050764e2aa6 Actions::Candlepin::ListenOnCandlepinEvents running pending 2015-09-11 15:00:00 UTC Show
478a94c1-5147-4393-8577-e07ade508da6 Actions::Katello::System::GenerateApplicability paused pending 2015-09-10 13:37:18 UTC Show
d67f32d1-4e41-401e-85c1-f25eec40357f Actions::Katello::System::GenerateApplicability paused pending 2015-09-10 09:37:18 UTC Show
be01d167-27ec-4ca7-ba64-72b45cbb63b4 Actions::Katello::System::GenerateApplicability paused pending 2015-09-10 05:36:06 UTC Show
29959e47-6c6d-4b0d-aca1-810e39ef82b4 Actions::Katello::System::GenerateApplicability paused pending 2015-09-10 01:36:06 UTC Show
a032cbc2-a03c-498b-a688-38a4424ac1bc Actions::Katello::System::GenerateApplicability paused pending 2015-09-09 21:36:07 UTC Show
c1582ea3-f1ad-4d83-aca6-6e5243c0a64e Actions::Katello::System::GenerateApplicability paused pending 2015-09-09 17:36:08 UTC Show
4a8711ab-a512-4819-b637-137067ef6634 Actions::Katello::System::GenerateApplicability paused pending 2015-09-09 13:36:06 UTC Show
ffa2d1ab-aad4-45df-980c-ad44748bf34b Actions::Katello::System::GenerateApplicability paused pending 2015-09-09 01:36:05 UTC Show
ef25a647-422f-476e-909f-6fed7ccc9c53 Actions::Katello::System::GenerateApplicability paused pending 2015-09-08 17:36:05 UTC Show

- IP Tables and selinux are in disabled state. UMASK is 0022.

--- Additional comment from Karthick Murugadhas on 2015-09-11 13:24:35 EDT ---

Few more things to add here,

- We have few more customers reporting the upgrade issues due to pending tasks left in the previous satellite version. Can we add some intelligence to our katello-upgrade script to check all per-requisites and stop if does not pass the per-requisites test?

Pre-requisites i.e:
~~~~~~~~~~~~~~~~~~
- Checkout the pending tasks and clear all pending tasks before the upgrade process or let the upgrade fail and indicate user to clear it out before they proceed with the upgrade. With current upgradation it is taking the satellite to 6.1.1 were few components or getting upgraded and few or not. As a result satellite is un-usable for customers, either we may need to include a rollback incase of failure with clear indication of what went wrong and what to look for fixing the issue.

- We are also noticing some permission issues, causing some services to fail. It would be great if we can include this check in the per-requisites to avoid service failure.

Few customers hitting the issue:
01495830
01429997
01499395

http://post-office.corp.redhat.com/archives/sat6-prio/2015-August/msg00097.html

Note
This customer "01495830" is in a hurry to fix their satellite environment and they were looking for an immediate fix.

--- Additional comment from Karthick Murugadhas on 2015-09-11 13:26:14 EDT ---

--- Additional comment from Pavel Moravec on 2015-10-15 09:51:10 EDT ---

In one case (01521820), the root cause was tomcat failed to deploy candlepin app, logging to /var/log/tomcat/catalina.*log:

Oct 14, 2015 2:17:00 PM org.apache.catalina.startup.HostConfig deployDirectories
SEVERE: Error waiting for multi-thread deployment of directories to complete
java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:188)
at org.apache.catalina.startup.HostConfig.deployDirectories(HostConfig.java:1150)
at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:490)
at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1614)
at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:330)
at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:117)
at org.apache.catalina.util.LifecycleBase.fireLifecycleEvent(LifecycleBase.java:90)
at org.apache.catalina.util.LifecycleBase.setStateInternal(LifecycleBase.java:402)
at org.apache.catalina.util.LifecycleBase.setState(LifecycleBase.java:347)
at org.apache.catalina.core.ContainerBase.startInternal(ContainerBase.java:1140)
at org.apache.catalina.core.StandardHost.startInternal(StandardHost.java:799)
at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
at org.apache.catalina.core.ContainerBase$StartChild.call(ContainerBase.java:1559)
at org.apache.catalina.core.ContainerBase$StartChild.call(ContainerBase.java:1549)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.hornetq.api.core.HornetQBuffers.wrappedBuffer(HornetQBuffers.java:79)
at org.hornetq.core.persistence.impl.journal.JournalStorageManager.loadMessageJournal(JournalStorageManager.java:1555)
at org.hornetq.core.server.impl.HornetQServerImpl.loadJournals(HornetQServerImpl.java:1760)
at org.hornetq.core.server.impl.HornetQServerImpl.initialisePart2(HornetQServerImpl.java:1567)
at org.hornetq.core.server.impl.HornetQServerImpl.access$1400(HornetQServerImpl.java:170)
at org.hornetq.core.server.impl.HornetQServerImpl$SharedStoreLiveActivation.run(HornetQServerImpl.java:2103)
at org.hornetq.core.server.impl.HornetQServerImpl.start(HornetQServerImpl.java:426)
at org.candlepin.audit.HornetqContextListener.contextInitialized(HornetqContextListener.java:103)
at org.candlepin.guice.CandlepinContextListener.contextInitialized(CandlepinContextListener.java:119)
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4973)
at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5467)
at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:901)
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:877)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:632)
at org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:1247)
at org.apache.catalina.startup.HostConfig$DeployDirectory.run(HostConfig.java:1898)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
... 4 more

This might be avoided / workarounded by tuning Java heap space: try add to /etc/sysconfig/tomcat :

JAVA_OPTS="-Xms1G -Xmx1G"

and restart tomcat / tomcat6 service.

--- Additional comment from Nikola Stiasni on 2015-10-15 13:48:06 EDT ---

The following KCS has been created. Please note that the following KCS has been already verified in at least one case.

https://access.redhat.com/solutions/1991693

--- Additional comment from Nikola Stiasni on 2015-10-19 13:55:52 EDT ---

It seems that in the new Satellite there is something wrong with Candlepin.
Although Candlepin can be forced to get deployed by allocating more Java Heap space, that workaround is not a solution.
In one case (#01525964), Candlepin is occupying 44GB of disk space! There are many .msg files being created in /var/lib/candlepin/hornetq/largemsgs.

Moreover, in logs one can also see the message:

SEVERE: Error waiting for multi-thread deployment of directories to complete
java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded

That particular Satellite works, however. "hammer ping" shows everything as OK.

--- Additional comment from Nikola Stiasni on 2015-10-22 10:58:19 EDT ---

Customer (#01521820) says:

slowly it is becoming embarrassing for Red Hat: I was transferring today more and more of my servers from RHN to Satellite. During this the Satellite was becoming slower and slower.
In the meantime I have allocated to the Satellite VM 10 CPus and 24GB RAM, but all CPUs are constantly busy. After nothing was working anymore (subscription-manager register) and web interface did not react, I rebooted the server. Now the status is:

[root@ffurhelsat01 ~]# hammer ping
candlepin:
Status: FAIL
Server Response: Message: 404 Resource Not Found
candlepin_auth:
Status: FAIL
Server Response: Message: Katello::Resources::Candlepin::CandlepinPing: 404 Resource Not Found (GET /candlepin/status)

It cannot be that Satellite with the (needed) hardware resources and default settings supports just a couple of Content Hosts.
By setting "JAVA_OPTS="-Xms6G -Xmx6G" (/etc/sysconfig/tomcat) , it is working now.

--------------
Remark: this was freshly installed Satellite 6.1.3 after the last one broke down in a similar manner.

--- Additional comment from Amit Upadhye on 2015-10-22 11:03:49 EDT ---

Hello,

I can see class not found errors http://pastebin.test.redhat.com/321899 in tomcat logs of case 01509094 .

1) Error injecting constructor, java.lang.NoClassDefFoundError: org/apache/qpid/url/URLSyntaxException
at org.candlepin.guice.AMQPBusPubProvider.<init>(AMQPBusPubProvider.java:66)
while locating org.candlepin.guice.AMQPBusPubProvider
at org.candlepin.guice.CandlepinModule.configureAmqp(CandlepinModule.java:292)
while locating org.candlepin.audit.AMQPBusPublisher
_
Amit Upadhye.

--- Additional comment from Pavel Moravec on 2015-10-23 02:55:15 EDT ---

(In reply to Amit Upadhye from comment #7)

Hello,

I can see class not found errors http://pastebin.test.redhat.com/321899 in
tomcat logs of case 01509094 .

1) Error injecting constructor, java.lang.NoClassDefFoundError:
org/apache/qpid/url/URLSyntaxException
at
org.candlepin.guice.AMQPBusPubProvider.<init>(AMQPBusPubProvider.java:66)
while locating org.candlepin.guice.AMQPBusPubProvider
at
org.candlepin.guice.CandlepinModule.configureAmqp(CandlepinModule.java:292)
while locating org.candlepin.audit.AMQPBusPublisher
_
Amit Upadhye.

This is a common "error" that appears on a healthy deployments as well. Sat6 has more such re herrings :(

--- Additional comment from Sachin Ghai on 2015-10-23 03:11:55 EDT ---

(In reply to Pavel Moravec from comment #8)

(In reply to Amit Upadhye from comment #7)

Hello,

I can see class not found errors http://pastebin.test.redhat.com/321899 in
tomcat logs of case 01509094 .

1) Error injecting constructor, java.lang.NoClassDefFoundError:
org/apache/qpid/url/URLSyntaxException
at
org.candlepin.guice.AMQPBusPubProvider.<init>(AMQPBusPubProvider.java:66)
while locating org.candlepin.guice.AMQPBusPubProvider
at
org.candlepin.guice.CandlepinModule.configureAmqp(CandlepinModule.java:292)
while locating org.candlepin.audit.AMQPBusPublisher
_
Amit Upadhye.

This is a common "error" that appears on a healthy deployments as well. Sat6
has more such re herrings :(

These errors appears when user doesn't stop tomcat service before "yum update" while following upgrade steps.

--- Additional comment from Sachin Ghai on 2015-10-23 03:21:49 EDT ---

I would like to reproduce this locally to see what causing these issues, especially java heap issue. So can we get some more info on this bz ?

1) On which base OS sat6 server was installed ? rhel6.? or rhel7.?
2) how many RedHat subscriptions were enabled or repos were synced, a rough estimate would help ?

--- Additional comment from Nikola Stiasni on 2015-10-26 05:54:56 EDT ---

Sachin,

The system is:
Red Hat Enterprise Linux Server release 7.1 (Maipo)

Customer has around 390 repositories (RHEL5, RHEL6, RHEL7, both i386 and x86_64, EPEL, etc.), around 107 content-hosts, approximately 40 from these are Hypervisors. VIRTWHO_INTERVAL is the default value (3600).

His Satellite keeps breaking down all the time and he increased
JAVA_OPTS="-Xms15G -Xmx15G" so that Candlepin can be deployed.

--- Additional comment from Sachin Ghai on 2015-10-26 06:50:53 EDT ---

Thanks for the info Nikola. I just upgraded a system from sat6.0.8 to sat6.1.3 and couldn't reproduce the reported issue.

I enabled rhel71 and rhel67 repos from Red Hat manifest and kept 4-5 tasks in pending state. However upgrade went successful.

[root@cloud-qe-7 ~]# katello-installer --upgrade
Upgrading...
Upgrade Step: stop_services...
Upgrade Step: start_mongo...
Upgrade Step: migrate_pulp...
Upgrade Step: start_httpd...
Upgrade Step: migrate_candlepin...
Upgrade Step: migrate_foreman...
Upgrade Step: Running installer...
Installing Done [100%] [..................................................................]
The full log is at /var/log/katello-installer/katello-installer.log
Upgrade Step: restart_services...
Upgrade Step: db_seed...
Upgrade Step: errata_import (this may take a while) ...
Upgrade Step: update_gpg_urls (this may take a while) ...
Upgrade Step: update_repository_metadata (this may take a while) ...
Katello upgrade completed!
[root@cloud-qe-7 ~]# hammer -u admin -p changeme ping
candlepin:
Status: ok
Server Response: Duration: 27ms
candlepin_auth:
Status: ok
Server Response: Duration: 18ms
pulp:
Status: ok
Server Response: Duration: 26ms
pulp_auth:
Status: ok
Server Response: Duration: 15ms
elasticsearch:
Status: ok
Server Response: Duration: 26ms
foreman_tasks:
Status: ok
Server Response: Duration: 1ms

I saw following statements in logs for pending tasks but didn't see any failure

2015-10-26 06:22:42 [I] Connecting to database specified by database.yml
2015-10-26 06:22:49 [W] Creating scope :completer_scope. Overwriting existing method Location.completer_scope.
2015-10-26 06:22:49 [W] Creating scope :completer_scope. Overwriting existing method Organization.completer_scope.
2015-10-26 06:22:54 [E] Abnormal execution plans, process was probably killed.
Following ExecutionPlans will be set to paused,
it should be fixed manually by administrator.
ExecutionPlan state result
c767737a-63b1-4eec-9939-294395ecb4db running pending
2015-10-26 06:22:58 [I] init config for SecureHeaders::Configuration
2015-10-26 06:23:03 [I] shutting down Core ...
2015-10-26 06:23:03 [E] ... core terminated.

--- Additional comment from Sachin Ghai on 2015-10-26 06:54:11 EDT ---

Since customer has enabled over 390 repositories so I guess that's causing the candlepin failures.. I'll prepare another setup again with multiple repositories and will perform upgrade.

--- Additional comment from Nikola Stiasni on 2015-10-26 09:01:21 EDT ---

More information:

[root@ffurhelsat01 largemsgs]# du -hs /var/lib/candlepin/hornetq/journal
5.5G /var/lib/candlepin/hornetq/journal

That is not normal.

--- Additional comment from Alex Wood on 2015-10-28 12:59:37 EDT ---

I think we have a good idea of what is causing the OOM issues that
customers are seeing [1]. What we believe is happening is that during
the upgrade process, the keystore password in
/etc/gutterball/gutterball.conf is set to an incorrect value. When
Gutterball starts, it cannot connect to QPid to pull events down. The
events stack up in QPid and eventually QPid gets so bogged down that
the events stack up in HornetQ. At that point, HornetQ runs the JVM
out of memory and everything crashes. I set up this scenario in my
system and very quickly got my heap usage up to around 91%. I never
got the OOM, but I imagine with the additional load of Pulp on Qpid,
that would be enough to bring everything to a halt.

Several people hitting the issue indicate that they has two
gutterball.conf files each with different passwords for the keystore.
I am not sure why the password is getting changed during the upgrade process. They can check the password very easily like so:

% keytool -list -storepass PASSWORD_HERE -keystore
/etc/gutterball/certs/amqp/gutterball.jks

(The keytool command is part of the Java JDK and I believe Katello may
not install that by default. If keytool isn't found, the user may
need to yum install java-1.7.0-openjdk-devel and execute keytool from
/usr/lib/jvm/JAVA_VERSION_HERE/jre/bin/)

Trying that keytool command with the passwords from both the
gutterball.conf and gutterball.conf.rpmsave files should be elluminating. A correct password will print some output; an incorrect password will print "keytool error: java.io.IOException: Keystore was tampered with, or password
was incorrect"

I will wager that the keytool command will work with the password from
the rpmsave file. If that is the case, users can back up the current
gutterball.conf and just `mv gutterball.conf.rpmsave gutterball.conf`.
(I did not see any actual changes to the file contents beyond the
password).

--- Additional comment from Nikola Stiasni on 2015-10-29 06:04:33 EDT ---

Unfortunately, this seems not to be the reason. The customer checked it and the command gave expected output.

Moreover, he re-installed the Satellite from scratch, so it seems that this issue is not only affecting updated Satellites.

This issue is related with Bug 1274669.

--- Additional comment from RHEL Product and Program Management on 2015-11-19 10:25:07 EST ---

Since this issue was entered in Red Hat Bugzilla, the pm_ack has been
set to + automatically for the next planned release


Related issues

Related to Katello - Feature #15611: Create a preupgrade script to check systems before upgradesClosed2016-07-07

Associated revisions

Revision 2277165f (diff)
Added by David Davis almost 3 years ago

Refs #15062 - Adding documentation for upgrade script

Revision 3b8dc318
Added by David Davis almost 3 years ago

Merge pull request #270 from daviddavis/temp/20160608153045

Refs #15062 - Adding documentation for upgrade script

History

#1 Updated by The Foreman Bot about 3 years ago

  • Status changed from New to Ready For Testing
  • Pull request https://github.com/theforeman/foreman-tasks/pull/186 added

#2 Updated by Eric Helms about 3 years ago

  • Legacy Backlogs Release (now unused) set to 86

#3 Updated by Eric Helms about 3 years ago

  • Project changed from Katello to foreman-tasks
  • Category deleted (Upgrades)
  • Legacy Backlogs Release (now unused) deleted (86)

#4 Updated by The Foreman Bot about 3 years ago

  • Pull request https://github.com/Katello/katello-installer/pull/337 added

#5 Updated by David Davis almost 3 years ago

  • Pull request deleted (https://github.com/Katello/katello-installer/pull/337, https://github.com/theforeman/foreman-tasks/pull/186)

#6 Updated by The Foreman Bot almost 3 years ago

  • Pull request https://github.com/Katello/katello/pull/6097 added

#7 Updated by The Foreman Bot almost 3 years ago

  • Pull request https://github.com/Katello/katello.org/pull/270 added

#8 Updated by Eric Helms almost 3 years ago

  • Project changed from foreman-tasks to Katello

#9 Updated by Eric Helms almost 3 years ago

  • Legacy Backlogs Release (now unused) set to 163

#10 Updated by Eric Helms almost 3 years ago

  • Legacy Backlogs Release (now unused) changed from 163 to 170

#11 Updated by David Davis almost 3 years ago

  • Related to Feature #15611: Create a preupgrade script to check systems before upgrades added

#12 Updated by Eric Helms almost 3 years ago

  • Status changed from Ready For Testing to Closed

Also available in: Atom PDF