Bug #23455
closedUpgrade and migration 1.15.6 to 1.16.1 (with Katello) - Foreman unstable and unusable
Description
We are moving/migrating/upgrading our Foreman installation.
The backup/restore/upgrade went well (I hit https://bugzilla.redhat.com/show_bug.cgi?id=1556819, but the workaround offered there worked).
As I began attaching the smart-proxies back to the new Foreman, the first few went without issue. Eventually, though, I began to get flooded with messages like this in the log:
Apr 30 12:41:40 foreman-01 qpidd: 2018-04-30 12:41:40 [Protocol] error Error on attach: Node not found: pulp.agent.c0500867-d215-408e-a1ed-f8ed8b5e4075
Apr 30 12:41:40 foreman-01 qpidd32429: 2018-04-30 12:41:40 [Protocol] error Error on attach: Node not found: pulp.agent.c0500867-d215-408e-a1ed-f8ed8b5e4075
Apr 30 12:41:40 foreman-01 qpidd32429: 2018-04-30 12:41:40 [Protocol] error Error on attach: Node not found: pulp.agent.8278fc78-0b10-4ef5-b2de-b22b3e502007
Apr 30 12:41:40 foreman-01 qpidd: 2018-04-30 12:41:40 [Protocol] error Error on attach: Node not found: pulp.agent.8278fc78-0b10-4ef5-b2de-b22b3e502007
Apr 30 12:41:44 foreman-01 qpidd32429: 2018-04-30 12:41:44 [Protocol] error Error on attach: Node not found: pulp.agent.a3754ba6-6067-4c46-899a-248cabc4a2d8
Apr 30 12:41:44 foreman-01 qpidd: 2018-04-30 12:41:44 [Protocol] error Error on attach: Node not found: pulp.agent.a3754ba6-6067-4c46-899a-248cabc4a2d8
Apr 30 12:41:45 foreman-01 qpidd: 2018-04-30 12:41:45 [Protocol] error Error on attach: Node not found: pulp.agent.033fe747-c608-4207-9176-a17cdedbff49
Apr 30 12:41:45 foreman-01 qpidd32429: 2018-04-30 12:41:45 [Protocol] error Error on attach: Node not found: pulp.agent.033fe747-c608-4207-9176-a17cdedbff49
Apr 30 12:41:45 foreman-01 qpidd32429: 2018-04-30 12:41:45 [Protocol] error Error on attach: Node not found: pulp.agent.af48ce1d-a58d-4b58-bf0f-c005e2c754fc
Apr 30 12:41:45 foreman-01 qpidd: 2018-04-30 12:41:45 [Protocol] error Error on attach: Node not found: pulp.agent.af48ce1d-a58d-4b58-bf0f-c005e2c754fc
...and I notice that the UI is locked up and unresponsive until a restart, in which case it is only operable for a few minutes.
There are constantly 5 or 6 postgres tasks taking up a full core that do not seem to resolve themselves, even when they've been let run for several hours.
33848 | -00:00:01.183384 | foreman | SELECT "sources".* FROM "sources" WHERE "sources"."digest" = $1 ORDER BY "sources"."id" ASC LIMIT 1
33963 | -00:00:03.247328 | foreman | SELECT "sources".* FROM "sources" WHERE "sources"."digest" = $1 ORDER BY "sources"."id" ASC LIMIT 1
33928 | -00:00:04.361622 | foreman | SELECT "sources".* FROM "sources" WHERE "sources"."digest" = $1 ORDER BY "sources"."id" ASC LIMIT 1
33889 | -00:00:04.430823 | foreman | SELECT "sources".* FROM "sources" WHERE "sources"."digest" = $1 ORDER BY "sources"."id" ASC LIMIT 1
33780 | -00:00:04.430876 | foreman | SELECT "sources".* FROM "sources" WHERE "sources"."digest" = $1 ORDER BY "sources"."id" ASC LIMIT 1
33814 | -00:00:04.431819 | foreman | SELECT "sources".* FROM "sources" WHERE "sources"."digest" = $1 ORDER BY "sources"."id" ASC LIMIT 1
Updated by Josh Pavel over 6 years ago
I enabled sql logging, and it appeared that the postgres queries that were holding up everything were related to processing OpenSCAP. I turned off OpenSCAP and ran vacuum analyze, and things seem to be much better. I am still getting the qpidd/pulp.agent errors.
Updated by Andrew Kofink over 6 years ago
- Tracker changed from Support to Bug
Can you confirm that your backup/restore restored /var/lib/qpidd ?
Updated by Josh Pavel over 6 years ago
I believe it was not:
[root@foreman-01 katello-backup-20180428152442]# ls la. 2 root postgres 4096 Apr 28 20:29 .
total 249345816
drwxrwx--
drwxrwx---. 6 root postgres 12288 Apr 30 07:57 ..rw-r--r-. 1 root root 352057984 Apr 28 15:42 config_files.tar.gzrw-r--r-. 1 root root 25013972 Apr 28 15:42 .config.snarrw-r--r-. 1 root root 41008 Apr 28 15:24 metadata.ymlrw-r--r-. 1 root root 14166466469 Apr 28 18:29 mongo_data.tar.gzrw-r--r-. 1 root root 126 Apr 28 18:29 .mongo.snarrw-r--r-. 1 root root 37676653124 Apr 28 18:08 pgsql_data.tar.gzrw-r--r-. 1 root root 44445 Apr 28 18:08 .postgres.snarrw-r--r-. 1 root root 202906142720 Apr 28 17:10 pulp_data.tarrw-r--r-. 1 root root 203443453 Apr 28 17:10 .pulp.snar
Updated by John Mitsch over 6 years ago
- Assignee set to Christine Fouant
The directory /var/lib/qpidd should be in config_files.tar.gz, you can look at the directories it contains with vim or with tar -tf config_files.tar.gz
Updated by Justin Sherrill over 6 years ago
- Status changed from New to Need more information
Josh, were you able to check that tar file?
Updated by Josh Pavel over 6 years ago
Yes - I extracted config_files.tar.gz, and the only directories it had under /var/lib were "candlepin" and "puppet" - nothing for qpidd.
The on-going issue I have is that qdrouterd does not seem to be functional; even the foreman server itself can't connect. I have this repeatedly in the log:
2018-05-21 20:57:29.586236 +0000 SERVER (info) Connection from <IP>:47878 (to 0.0.0.0:5646) failed: amqp:connection:framing-error SSL Failure: Unknown error
Updated by Andrew Kofink over 6 years ago
- Assignee changed from Christine Fouant to John Mitsch
Updated by John Mitsch over 6 years ago
Discussed this off-thread, where I suggested taring /var/lib/qpidd from the old machine and untaring on the new machine. I'm not sure why qpidd was missed in the backup, it should be included from the code logic
Updated by Josh Pavel over 6 years ago
As John stated, I manually copied over the .qpidd data from the old server to the new. That is now in place, but I am still having issues, specifically from what I can tell related to qdrouterd related functions.
As I see qdrouterd SSL errors, I looked at the config:
ssl-profile {
name: server
cert-db: /etc/pki/katello/certs/katello-default-ca.crt
cert-file: /etc/pki/katello/qpid_router_server.crt
key-file: /etc/pki/katello/qpid_router_server.key
}
If I look at /etc/pki/katello-certs-tools/certs/katello-default-ca.crt, I see the old hostname in there (and it has a date from 2017, before the server rename/migration).
The errors in qdrouterd.log that I see are:
2018-05-31 15:04:46.277338 +0000 SERVER (info) Connection from <redacted IP>:58952 (to 0.0.0.0:5646) failed: amqp:connection:framing-error SSL Failure: Unknown error
and
2018-05-31 15:04:56.347872 +0000 SERVER (info) Connection from <redacted IP>:41064 (to 0.0.0.0:5646) failed: amqp:connection:framing-error No valid protocol header found
Those seem coupled with:
2018-05-31 15:04:46.277869 +0000 ROUTER_LS (info) Link to Neighbor Router Lost - link_tag=8
Updated by John Mitsch over 6 years ago
Josh,
Were you able to resolve this issue?
-John
Updated by Josh Pavel over 6 years ago
Unfortunately no. Qdrouterd is still full of errors, and all hosts report "unknown" status, even with katello-agent running. I believe the certs are mismatched with the hostname. How can I regenerate those?
Updated by Justin Sherrill over 6 years ago
You could try using katello-change-hostname to change to a new hostname (and then potentially change it back). That should force the regeneration of all the certs.
Do you have multiple smart proxies running running qdrouterd? There is currently a bug in proton that causes all qdrouterd's to start erroring if any one smart proxy has an ssl issue: https://issues.apache.org/jira/browse/PROTON-1587
What made you think there is some cert hostname mismatch? Have you found any evidence of that?
Updated by Josh Pavel over 6 years ago
The reason I suspect it is that the ca issuer/subject/X509v3 Authority Key Identifier all contain the original hostname.
/etc/pki/katello/qpid_router_server.crt's Issuer is the old hostname; the subject is the new name, and the X509v3 Authority Key Identifier is the old name.
Updated by Jonathon Turel over 6 years ago
Did you observe that behavior after running katello-change-hostname? You may want to give it a try!
Updated by Ewoud Kohl van Wijngaarden over 1 year ago
- Status changed from Need more information to Rejected
Closing because of it age.