Project

General

Profile

Bug #26574

Segmentation Fault - unprooper file descritpor handling (Foreman, Katello, Dynflow or others)

Added by Rafał Ramocki 8 months ago. Updated 8 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Dynflow
Target version:
-
Difficulty:
Triaged:
No
Bugzilla link:
Pull request:
Fixed in Releases:
Found in Releases:

Description

First of all We (I with my colegue) are not sure if this problem should be assigned to foreman, dynflow, katello, foreman libvirt plugin, libvirt it self or gnutls.

Problem manifests its self by segmentation fault in dynflow during applying errata on systems that are running as QEMU-KVM/LibVirt guests. We are using TLS to secure communication channel between Foreman and Libvirt hypervisors. Segmenation fault is as follows:

> /var/log/foreman/dynflow_executor.output <
/opt/theforeman/tfm/root/usr/share/gems/gems/fog-libvirt-0.4.1/lib/fog/libvirt/compute.rb:102: [BUG] Segmentation fault at 0x0000000000000000
ruby 2.5.3p105 (2018-10-18 revision 65156) [x86_64-linux]

-- Control frame information -----------------------------------------------
c:0190 p:---- s:0940 e:000939 CFUNC :open
c:0189 p:0146 s:0935 e:000934 METHOD /opt/theforeman/tfm/root/usr/share/gems/gems/fog-libvirt-0.4.1/lib/fog/libvirt/compute.rb:102 [FINISH]
c:0188 p:---- s:0929 e:000928 CFUNC :new

I'm not sure why this job tries to connect libvirt on hypervisor while applying errata on libvirt guests. In line 102 is jus oppening connection to libvirt:

@client = ::Libvirt::open(uri.uri)

I've generated core dump, and gdb shows that segmentation fault is in fact double free in gnutls when libvirt is oppening the connection:

#4 <signal handler called>
#5 GI_libc_free (mem=0xc0) at malloc.c:2941
#6 0x00007f89eec65cc0 in gnutls_certificate_free_credentials (sc=0x7f89ac6dcd90) at gnutls_cert.c:199
#7 0x00007f89f0c2439a in virNetTLSContextNew (cacert=0x7f89acc88940 "/etc/pki/CA/cacert.pem", cacrl=0x7f89ac801980 "/etc/pki/CA/cacrl.pem", cert=0x7f89acb50be0 "/etc/pki/libvirt/clientcert.pem",
key=0x7f89acb4e850 "/etc/pki/libvirt/private/clientkey.pem", x509dnWhitelist=x509dnWhitelist@entry=0x0, priority=priority@entry=0x0, sanityCheckCert=sanityCheckCert@entry=true, requireValidCert=requireValidCert@entry=true,
isServer=isServer@entry=false) at rpc/virnettlscontext.c:762
#8 0x00007f89f0c250d2 in virNetTLSContextNewPath (pkipath=<optimized out>, tryUserPkiPath=<optimized out>, x509dnWhitelist=x509dnWhitelist@entry=0x0, priority=priority@entry=0x0, sanityCheckCert=sanityCheckCert@entry=true,
requireValidCert=requireValidCert@entry=true, isServer=isServer@entry=false) at rpc/virnettlscontext.c:908
#9 0x00007f89f0c25f61 in virNetTLSContextNewClientPath (pkipath=<optimized out>, tryUserPkiPath=<optimized out>, priority=priority@entry=0x0, sanityCheckCert=sanityCheckCert@entry=true, requireValidCert=requireValidCert@entry=true)
at rpc/virnettlscontext.c:937
#10 0x00007f89f0cd48cb in doRemoteOpen (conn=conn@entry=0x7f89ac6b91c0, priv=0x7f89acad2450, driver_str=0x7f89ac2e2df0 "qemu", transport_str=<optimized out>, auth=auth@entry=0x0, conf=conf@entry=0x0, flags=0)
at remote/remote_driver.c:968
#11 0x00007f89f0cd6436 in remoteConnectOpen (conn=0x7f89ac6b91c0, auth=0x0, conf=0x0, flags=<optimized out>) at remote/remote_driver.c:1385
#12 0x00007f89f0ce6207 in virConnectOpenInternal (name=<optimized out>, name@entry=0x7f89acb2ee30 "qemu+tls://blade4s14.hv.internal/system", auth=auth@entry=0x0, flags=flags@entry=0) at libvirt.c:1040

After further investigation my colegue discovered that GNUTLS is unable to read /dev/urandom:

gnutls3: ASSERT: common.c:1106
gnutls3: ASSERT: common.c:1964
gnutls3: ASSERT: x509.c:3146
gnutls3: ASSERT: common.c:1106
gnutls3: ASSERT: common.c:1106
gnutls3: ASSERT: common.c:1964
gnutls3: ASSERT: x509.c:3146
gnutls3: ASSERT: common.c:1106
gnutls3: ASSERT: common.c:1106
gnutls3: ASSERT: x509_ext.c:2897
gnutls3: ASSERT: x509.c:3140
gnutls3: ASSERT: dn.c:990
gnutls3: ASSERT: common.c:1106
gnutls3: ASSERT: extensions.c:65
gnutls3: ASSERT: name_constraints.c:173
gnutls3: ASSERT: mpi.c:240
gnutls3: ASSERT: x509_ext.c:115
gnutls3: ASSERT: x509.c:1396
gnutls2: Failed to read /dev/urandom: Resource temporarily unavailable
gnutls3: ASSERT: rnd.c:142
gnutls3: ASSERT: rnd.c:343
gnutls3: ASSERT: gnutls_privkey.c:800
gnutls2: _gnutls_check_key_cert_match: failed signing

and further investigation shown that file descriptor (9) that should be pointing to /dev/urandom is in fact pointing to socket connected to database, and that is why gnutls is getting resource temporary unavailable. So we knew that something, at some point of time is closing this file descriptior. Next my colegue found that this FD is closed on startup procedure of dynflow:

[pid 22432] 0.000012 [00007f38280d9fed] write(2, "/opt/theforeman/tfm/root/usr/sha"..., 142/opt/theforeman/tfm/root/usr/share/gems/gems/daemons-1.2.3/lib/daemons/daemonize.rb:75: warning: conflicting chdir during another chdir block
[pid 22432] 0.000022 [00007f38280d9fed] <... write resumed> ) = 142 <0.000037>
[pid 22432] 0.000022 [00007f38280da7c7] chdir("/" <unfinished ...>
[pid 22432] 0.000050 [00007f38280da7c7] <... chdir resumed> ) = 0 <0.000060>
[pid 22432] 0.009264 [00007f3828b0375d] close(8) = 0 <0.000049>
[pid 22432] 0.000125 [00007f3828b0375d] close(7) = 0 <0.000016>
[pid 22432] 0.000076 [00007f3828b0375d] close(11) = 0 <0.000023>
[pid 22432] 0.000617 [00007f3828b03874] fcntl(7, F_GETFL) = -1 EBADF (Bad file descriptor) <0.000016>
[pid 22432] 0.000206 [00007f3828b03874] fcntl(8, F_GETFL) = -1 EBADF (Bad file descriptor) <0.000016>
[pid 22432] 0.000119 [00007f3828b03874] fcntl(9, F_GETFL) = 0x8000 (flags O_RDONLY|O_LARGEFILE) <0.000016>
[pid 22432] 0.000056 [00007f38280df039] ioctl(9, TCGETS, 0x7fffa71989c0) = -1 EINVAL (Invalid argument) <0.000024>
[pid 22432] 0.000088 [00007f3828b0375d] close(9) = 0 <0.000016>
[pid 22432] 0.000058 [00007f3828b03874] fcntl(10, F_GETFL) = 0x8401 (flags O_WRONLY|O_APPEND|O_LARGEFILE) <0.000015>
[pid 22432] 0.000054 [00007f38280df039] ioctl(10, TCGETS, 0x7fffa71989c0) = -1 ENOTTY (Inappropriate ioctl for device) <0.000017>
[pid 22432] 0.000058 [00007f3828b0375d] close(10) = 0 <0.000016>

This leaded us to /opt/theforeman/tfm/root/usr/share/gems/gems/daemons-1.2.3/lib/daemons/daemonize.rb where there is function close_io:

121 def close_io
122 # Make sure all input/output streams are closed
123 # Part I: close all IO objects (except for STDIN/STDOUT/STDERR)
124 ObjectSpace.each_object(IO) do |io|
125 unless [STDIN, STDOUT, STDERR].include?(io)
126 io.close rescue nil
127 end
128 end
129
130 # Make sure all input/output streams are closed
131 # Part II: close all file decriptors (except for STDIN/STDOUT/STDERR)
132 3.upto(8192) do |i|
133 IO.for_fd(i).close rescue nil
134 end
135 end
136 module_function :close_io

This method is called from "call_as_daemon" and "deamonize" methods. When I've commented out both calls start Segmentation fault do not appear any more, and everything works as expected but I'm not sure if there will be any FD leakage. This gem have this login in it even in newest version.

So... to sum this up.
1. I don't know why action on katello results in try to connect to libvirt but if we be able to skip it - so we may have workaround.
2. If we can reinitialize GNUTLS library before each job - we may have a solution or workaround
3. If we can not close all file descriptors inclunding those that were not open by the program its self and that are required to libraries to do their actions, so we may have solution.
4. Others?

Problem is quite complex and I'm not sure where and how it should be solved.

PS: Installed pacckages:
foreman-1.21.2-1.el7.noarch
foreman-bootloaders-redhat-201801241201-2.el7.noarch
foreman-bootloaders-redhat-tftpboot-201801241201-2.el7.noarch
foreman-cli-1.21.2-1.el7.noarch
foreman-compute-1.21.2-1.el7.noarch
foreman-console-1.21.2-1.el7.noarch
foreman-debug-1.21.2-1.el7.noarch
foreman-installer-1.21.2-1.el7.noarch
foreman-installer-katello-1.21.2-1.el7.noarch
foreman.internal-apache-1.0-1.noarch
foreman.internal-foreman-client-1.0-1.noarch
foreman.internal-foreman-proxy-1.0-1.noarch
foreman.internal-foreman-proxy-client-1.0-1.noarch
foreman.internal-puppet-client-1.0-1.noarch
foreman.internal-qpid-broker-1.0-2.noarch
foreman.internal-qpid-client-cert-1.0-1.noarch
foreman.internal-qpid-router-client-1.0-1.noarch
foreman.internal-qpid-router-server-1.0-1.noarch
foreman.internal-tomcat-1.0-1.noarch
foreman-libvirt-1.21.2-1.el7.noarch
foreman-openstack-1.21.2-1.el7.noarch
foreman-postgresql-1.21.2-1.el7.noarch
foreman-proxy-1.21.2-1.el7.noarch
foreman-release-1.21.2-1.el7.noarch
foreman-release-scl-7-2.el7.noarch
foreman-selinux-1.21.2-1.el7.noarch
puppet-foreman_scap_client-0.3.19-1.el7.noarch
rubygem-foreman_maintain-0.3.0-1.el7.noarch
rubygem-foreman_scap_client-0.4.0-1.el7.noarch
tfm-rubygem-foreman_ansible-2.3.2-1.fm1_21.el7.noarch
tfm-rubygem-foreman_ansible_core-2.2.0-1.fm1_21.el7.noarch
tfm-rubygem-foreman_bootdisk-14.0.0-1.fm1_21.el7.noarch
tfm-rubygem-foreman_column_view-0.4.0-3.fm1_20.el7.noarch
tfm-rubygem-foreman_docker-4.1.0-2.fm1_20.el7.noarch
tfm-rubygem-foreman_openscap-0.12.2-1.fm1_21.el7.noarch
tfm-rubygem-foreman_remote_execution-1.7.0-1.fm1_21.el7.noarch
tfm-rubygem-foreman_remote_execution_core-1.1.5-1.el7.noarch
tfm-rubygem-foreman-tasks-0.14.5-1.fm1_21.el7.noarch
tfm-rubygem-foreman-tasks-core-0.2.6-1.fm1_21.el7.noarch
tfm-rubygem-hammer_cli_foreman-0.16.0-1.el7.noarch
tfm-rubygem-hammer_cli_foreman_admin-0.0.8-2.el7.noarch
tfm-rubygem-hammer_cli_foreman_bootdisk-0.1.3-7.el7.noarch
tfm-rubygem-hammer_cli_foreman_discovery-1.0.0-3.el7.noarch
tfm-rubygem-hammer_cli_foreman_docker-0.0.4-4.el7.noarch
tfm-rubygem-hammer_cli_foreman_openscap-0.1.7-1.fm1_21.el7.noarch
tfm-rubygem-hammer_cli_foreman_tasks-0.0.13-1.fm1_20.el7.noarch
tfm-rubygem-hammer_cli_foreman_virt_who_configure-0.0.3-3.el7.noarch
tfm-rubygem-puppetdb_foreman-4.0.0-2.fm1_20.el7.noarch

History

#1 Updated by Lukas Zapletal 8 months ago

  • Category set to Dynflow
  • Project changed from Foreman to foreman-tasks

Moving to dynflow project, can you try to use TCP instead QEMU (I assume it's local UNIX socket)?

#2 Updated by Rafał Ramocki 8 months ago

No it is not local socket. It is TCP with TLS (qemu+tls://_host_/system)

#3 Updated by Rafał Ramocki 8 months ago

You may be interested in bug repport that colegue of mine filled against libvirt. This is try to prevent segmentation fault and memory leak. Fixing it on libvirt side will prevent segmentation faults but will not prevent closing file descriptor of /dev/urandom and putting gnutls library in error state.

Libvirt BUG report: https://bugzilla.redhat.com/show_bug.cgi?id=1699062

PS: While investigating this issue after patching LibVIRT with proposed patch by use we've discovered that in some cases other FD's seams to be ocasionaly unexpectly closed/switched. ex:

On Dynflow side:

Error: PG::ConnectionBad: PQconsumeInput() server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
: SELECT "taxonomies".* FROM "taxonomies" WHERE "taxonomies"."type" IN ('Organization') AND "taxonomies"."id" = $1 ORDER BY "taxonomies"."title" ASC LIMIT $2

On: Postgres side:

2019-04-11 18:29:00.099 CEST [12425] foreman@foreman FATAL: invalid frontend message type 50

Also available in: Atom PDF