Project

General

Profile

Actions

Bug #25342

open

Salt run breaks host if foreman-node fails to get grains.

Added by Danny S over 6 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Difficulty:
Triaged:
No
Fixed in Releases:
Found in Releases:

Description

When provisioning multiple hosts at once, the foreman (server) host can become very busy for a short period of time. In that case it sometimes occurs that the foreman-node command fails with the error:

2018-10-25 15:28:40,855 [salt.loaded.int.pillar.puppet][CRITICAL][1637] YAML data from /usr/bin/foreman-node failed to parse
2018-10-25 15:28:40,859 [salt.loaded.int.module.cmdmod][ERROR   ][1639] Command '/usr/bin/foreman-node <host>' failed with return code: 1
2018-10-25 15:28:40,859 [salt.loaded.int.module.cmdmod][ERROR   ][1639] output: Could not get grains: undefined method `each_pair' for nil:NilClass

or:
2018-10-25 15:30:30,656 [salt.loaded.int.pillar.puppet][CRITICAL][1581] YAML data from /usr/bin/foreman-node failed to parse
2018-10-26 14:22:35,661 [salt.loaded.int.module.cmdmod][ERROR   ][1677] Command '/usr/bin/foreman-node <host>' failed with return code: 1
2018-10-26 14:22:35,662 [salt.loaded.int.module.cmdmod][ERROR   ][1677] output: Couldn't retrieve ENC data: execution expired

In that case the foreman-node command fails to provide the foreman parameters as pillar data to the salt run. However salt simply continues to run without that data, it does not fail!

If there are conditions in the salt states that do depend on the foreman parameters, the host is configured in a completely different state than expected.

Moreover, salt may just report that all states are successfully executed, so there is no indication that something went wrong. That shouldn't happen in an error situation.

More detailed analysis

The issue not that easy to reproduce, as it requires the foreman host to become very busy.

However analysis showed that whenever a salt run is triggered, one of the first things that occur is that the foreman-node command is called. This command then does two things:

  1. Translate salt grains into foreman facts and performs a HTTP POST to foreman uri path: /api/hosts/facts
  2. Retrieves the foreman ENC parameters using a HTTP GET to the foreman uri path: /salt/node/<host>?format=yml

When either the resources of the foreman host are limited or when the foreman host becomes very busy, it takes such an amount of time that timeouts start to occur. There is a configurable timeout of 10 seconds in the configuration (/etc/salt/foreman.yml) and if a HTTP request from the foreman-node command to the foreman core takes up more than the timeout, it aborts with no data. Furthermore, the salt run ignores the failure and simply continues with an unexpected outcome, will still reporting that everything finished without issues (which is very misleading).

See a part of the foreman production log, which shows the handling of the HTTP requests from the foreman-node command, both in a normally situation and a faulty situation:

Normal (two HTTP requests are shown and handled in time):

2018-10-26 10:10:45 7fa93d77 [app] [I] Started POST "/api/hosts/facts" for 192.168.1.2 at 2018-10-26 10:10:45 +0000
2018-10-26 10:10:45 7fa93d77 [app] [I] Processing by Api::V2::HostsController#facts as JSON
2018-10-26 10:10:45 7fa93d77 [app] [I]   Parameters: {"name"=>"myhost.mydomain", "facts"=>"[FILTERED]", "apiv"=>"v2", :host=>{"name"=>"myhost.mydomain"}}
2018-10-26 10:10:45 7fa93d77 [audit] [I] [myhost.mydomain] deleted 0 (9.9ms)
2018-10-26 10:10:46 7fa93d77 [audit] [I] [myhost.mydomain] updated 1 (1222.8ms)
2018-10-26 10:10:46 7fa93d77 [app] [I] Import facts for 'myhost.mydomain' completed. Added: 0, Updated: 1, Deleted 0 facts
2018-10-26 10:10:46 7fa93d77 [app] [I] Completed 201 Created in 1380ms (Views: 8.6ms | ActiveRecord: 116.7ms)
2018-10-26 10:10:47 c56b1614 [app] [I] Started GET "/salt/node/myhost.mydomain?format=yml" for 192.168.1.2 at 2018-10-26 10:10:47 +0000
2018-10-26 10:10:47 c56b1614 [app] [I] Processing by ForemanSalt::MinionsController#node as YML
2018-10-26 10:10:47 c56b1614 [app] [I]   Parameters: {"id"=>"myhost.mydomain"}
2018-10-26 10:10:47 c56b1614 [app] [I]   Rendered text template (0.0ms)
2018-10-26 10:10:47 c56b1614 [app] [I] Completed 200 OK in 117ms (Views: 0.7ms | ActiveRecord: 14.2ms)

Faulty (one HTTP request that takes too long):

2018-10-25 15:27:09 d25243ea [app] [I] Started POST "/api/hosts/facts" for 192.168.1.2 at 2018-10-25 15:27:09 +0000
2018-10-25 15:27:10 d25243ea [app] [I] Processing by Api::V2::HostsController#facts as JSON
2018-10-25 15:27:10 d25243ea [app] [I]   Parameters: {"name"=>"myhost.mydomain", "facts"=>"[FILTERED]", "apiv"=>"v2", :host=>{"name"=>"myhost.mydomain"}}
2018-10-25 15:27:11 d25243ea [audit] [I] [myhost.mydomain] deleted 0 (78.3ms)
2018-10-25 15:27:19 d25243ea [audit] [I] [myhost.mydomain] updated 1 (7820.5ms)
2018-10-25 15:27:19 d25243ea [app] [I] Import facts for 'myhost.mydomain' completed. Added: 0, Updated: 1, Deleted 0 facts

The faulty scenario shows that the first HTTP request takes about 10 seconds, which is too long for the foreman-node command. Therefore the foreman-node command exits and doesn't even attempt the second request anymore.

No data to display

Actions

Also available in: Atom PDF