Bug #9098
closedPuppetCA removes certificates for VMs after a few days
Description
I have foreman creating VMs on Openstack. After a few days of that VM being up, the PuppetCA will somehow lose or delete the certificate for the VM. This causes the VM to get out of sync. I can fix it by deleting the certificate on the VM and requesting a new one, then approving it. But it should not be removing it at all.
I can not see anything in the proxy log file or production log file as to why it is removing the cert.
Updated by Dominic Cleal almost 10 years ago
If you don't see anything in the proxy.log (specifically a DELETE /puppet/ca/certname) then I don't think I see how this is Foreman?
Updated by Brian Lee almost 10 years ago
What else could it be? I just went through the log files tracing a host I build on friday which is now out of sync. In the proxy log I can see it being created and the autosign being deleted on the 23rd, which is the last of the proxy log. In the production log I see it checking in and reports happening until the 25th at 03:25. After that there is nothing in the log file, until I open up foreman and went to the host details.
The certs are signed for 5 years. The bare metal hosts do not have this problem, and it looks like windows VMs don't as well. Could it be something with the puppet client?
Updated by Dominic Cleal almost 10 years ago
- Status changed from New to Feedback
I don't know what it could be. For the record, another place to look is at your sudoers logs (e.g. auth.log) as by default, foreman-proxy runs all puppet cert/ca commands through sudo, so it'll be logged there too.
Updated by Brian Lee almost 10 years ago
There is no auth.log file, sudo command should be logged to messages. Going through those logs I see the host checking again at 03:25 on the 25th. Then nothing after that. I do see the puppet master starting at 03:48 on the 25th with this error: puppet-master13256: Config file /etc/puppet/hiera.yaml not found, using Hiera defaults
After that there is no checkins from the VM, only bare metal hosts. Why is the puppet master restarting?
Updated by Dominic Cleal almost 10 years ago
Brian Lee wrote:
There is no auth.log file, sudo command should be logged to messages.
Well, replace with whatever it is on your platform. EL uses /var/log/secure.
Going through those logs I see the host checking again at 03:25 on the 25th. Then nothing after that. I do see the puppet master starting at 03:48 on the 25th with this error: puppet-master13256: Config file /etc/puppet/hiera.yaml not found, using Hiera defaults
After that there is no checkins from the VM, only bare metal hosts. Why is the puppet master restarting?
Daily log rotation perhaps? Or Passenger spawning a new instance due to increased load?
Updated by Brian Lee almost 10 years ago
We currently have a very light load on the environment, no activity was happened with the exception of me doing some work on the VM itself. The logs do role, but that wouldn't restart the puppet master, would it?
The /var/log/secure file has no activity on the 25th at all either. No sudo commands were run around the time that the host stops checking in.
Updated by Dominic Cleal almost 10 years ago
Brian Lee wrote:
We currently have a very light load on the environment, no activity was happened with the exception of me doing some work on the VM itself. The logs do role, but that wouldn't restart the puppet master, would it?
Check the logrotate configs, I'd expect them to send a signal. I don't know how that interacts with running the master under Passenger, it may not.
The /var/log/secure file has no activity on the 25th at all either. No sudo commands were run around the time that the host stops checking in.
Then I don't think it can be Foreman's smart proxy.
Updated by Brian Lee almost 10 years ago
Should the certs stay in the /var/lib/puppet/ca/signed/ directory after the VM is up and running?
After building a VM I am noticing, what I think is a extra delete request for the cert:
10.32.109.100 - - [26/Jan/2015 14:59:09] "POST / HTTP/1.1" 200 - 0.0623
10.32.109.100 - - [26/Jan/2015 14:59:09] "POST / HTTP/1.1" 200 - 0.0621
E, [2015-01-26T15:00:06.128140 #2341] ERROR -- : Attempt to remove nonexistant client certificate for 8f4a7790-a50 d-45a5-abff-56421d13c99d
10.32.109.100 - - [26/Jan/2015 15:00:06] "DELETE /8f4a7790-a50d-45a5-abff-56421d13c99d HTTP/1.1" 404 89 1.5832
10.32.109.100 - - [26/Jan/2015 15:00:06] "POST /autosign/8f4a7790-a50d-45a5-abff-56421d13c99d HTTP/1.1" 200 - 0.00 11
10.32.109.100 - - [26/Jan/2015 15:01:50] "DELETE /8f4a7790-a50d-45a5-abff-56421d13c99d HTTP/1.1" 200 - 1.5144
10.32.109.100 - - [26/Jan/2015 15:01:50] "POST /autosign/8f4a7790-a50d-45a5-abff-56421d13c99d HTTP/1.1" 200 - 0.00 10
10.32.109.100 - - [26/Jan/2015 15:01:59] "DELETE /autosign/8f4a7790-a50d-45a5-abff-56421d13c99d HTTP/1.1" 200 - 0. 0476
This one build was the only thing going on, and that cert is not in the signed directory. Maybe its cached in the puppet master service and when it restarts its lost and causes the sync issue.
Updated by Dominic Cleal almost 10 years ago
Brian Lee wrote:
Should the certs stay in the /var/lib/puppet/ca/signed/ directory after the VM is up and running?
Yes, it should be there. If it's been revoked then I think it will still be listed in /var/lib/puppet/ca/inventory.txt, but if you did an openssl crl -in /var/lib/puppet/ssl/ca/ca_crl.pem -noout -text
then you will see a list of "Revoked Certificates" with their serial numbers - compare that against the inventory.
One thought - if if Apache is restarted (I'm thinking logrotate again), it will re-read the CRL - so it could be just a delayed reaction.
After building a VM I am noticing, what I think is a extra delete request for the cert:
10.32.109.100 - - [26/Jan/2015 14:59:09] "POST / HTTP/1.1" 200 - 0.0623
10.32.109.100 - - [26/Jan/2015 14:59:09] "POST / HTTP/1.1" 200 - 0.0621
E, [2015-01-26T15:00:06.128140 #2341] ERROR -- : Attempt to remove nonexistant client certificate for 8f4a7790-a50 d-45a5-abff-56421d13c99d
10.32.109.100 - - [26/Jan/2015 15:00:06] "DELETE /8f4a7790-a50d-45a5-abff-56421d13c99d HTTP/1.1" 404 89 1.5832
10.32.109.100 - - [26/Jan/2015 15:00:06] "POST /autosign/8f4a7790-a50d-45a5-abff-56421d13c99d HTTP/1.1" 200 - 0.00 11
10.32.109.100 - - [26/Jan/2015 15:01:50] "DELETE /8f4a7790-a50d-45a5-abff-56421d13c99d HTTP/1.1" 200 - 1.5144
10.32.109.100 - - [26/Jan/2015 15:01:50] "POST /autosign/8f4a7790-a50d-45a5-abff-56421d13c99d HTTP/1.1" 200 - 0.00 10
10.32.109.100 - - [26/Jan/2015 15:01:59] "DELETE /autosign/8f4a7790-a50d-45a5-abff-56421d13c99d HTTP/1.1" 200 - 0. 0476
Yeah, that is an additional request and a likely suspect. Perhaps you can try and compare it against /var/log/foreman/production.log to figure out which request is triggering it, and against the client's own logs to figure out whether the second DELETE request is after the first Puppet run where it's signed?
Foreman should only call the pair of "DELETE, POST /autosign" requests when:
- the user data template is rendered for image provisioning (when user data is ticked on the image)
- before running the finish script over SSH for image provisioning (when user data is unticked)
- with network/PXE provisioning, the "provision" (preseed/kickstart) template is requested
When the "built" request (used in network provisioning or with user data) is made at the end of the provisioning process, the "DELETE /autosign" request is made.
Are you using user data or finish scripts with OpenStack? Maybe one of the scripts is also requesting another script from Foreman, which triggers the Puppet CA again?
Updated by Brian Lee almost 10 years ago
Here is a pastebin of the production log around the same time as the above. I dont see anything out of the ordinary. http://pastebin.com/VLTvu25r
I am using finish scripts and user-data with the VMs in openstack. At the end of the script I have to do this:
sudo /usr/bin/puppet agent --config /etc/puppet/puppet.conf --onetime --tags no_such_tag --server <%= @host.puppetmaster > --no-daemonize
sudo /usr/bin/wget --quiet --output-document=/dev/null --no-check-certificate <= foreman_url %>
sudo /usr/bin/puppet agent --test
Is the built checkin the problem? should it happen after puppet runs?
Updated by Dominic Cleal almost 10 years ago
Was that from a finish script or a user data file? If it's a finish script then you don't need the wget/built call as Foreman blocks on script execution and will mark it as built once it's finished.
Do you have anything before that in the logs? I see a report coming in there, which could be one of the two Puppet runs.
Updated by Brian Lee almost 10 years ago
That was from a finish script. I ran a test and removed that line and it seems the certs are now staying.
I am running the same command in a user-data for the windows VMs. Should that be included in it?
Updated by Dominic Cleal almost 10 years ago
Brian Lee wrote:
That was from a finish script. I ran a test and removed that line and it seems the certs are now staying.
Ah, excellent.
I am running the same command in a user-data for the windows VMs. Should that be included in it?
Yes, keep the command in the user-data script.
Updated by Brian Lee almost 10 years ago
Okay, everything looks good. This can be closed out do to user error.
Updated by Dominic Cleal almost 10 years ago
- Status changed from Feedback to Resolved
Glad it's all working, thanks Brian.