Bug #38510
openBug Report: Candlepin causes 'Idle in Transaction' lock leading to 504 Gateway Timeouts
Description
Foreman Version: 3.14
Katello Version: 4.16
Operating System: RHEL 9.6
Foreman Server FQDN: itmxvlpforaio01.it.cobra.group
Smart Proxy FQDN: itva01vlpfpxsmgt01.intra.cobra.it
Hello Support Team,
We are opening this ticket to report a critical and reproducible issue that causes new client registration to fail.
1. Problem Description
When onboarding a new host via a Smart Proxy, the process systematically fails with a 504 Gateway Timeout error. Our initial analysis confirmed that the Smart Proxy is operational, but its upstream requests to the main Foreman server are timing out.
2. Analysis and Diagnosis
An in-depth investigation on the main Foreman server (itmxvlpforaio01) has revealed that the entire platform is suffering from severe performance degradation caused by a PostgreSQL database lock.
Root Cause Identified:
The Candlepin service, upon every startup, opens a database transaction that is never closed. This connection remains perpetually in an idle in transaction state. This behavior leads to a progressive degradation of database performance (likely due to table bloat and VACUUM being unable to clean up dead tuples), eventually causing complex operations like host registration to time out.
We have confirmed that this is a systemic issue that reappears immediately after every restart of the tomcat service.
3. Technical Evidence Collected
We have isolated the exact query and gathered irrefutable evidence demonstrating the nature of this bug.
A. Database State after Candlepin Startup:
The output from foreman-maintain service status clearly shows the stuck connection. A direct psql query confirms this:
[postgres@itmxvlpforaio01 ~]$ psql c "SELECT pid, usename, state FROM pg_stat_activity WHERE state = 'idle in transaction';" --------------------------------
pid | usename | state
-------
18748 | candlepin | idle in transaction
(1 row)
B. Exact Offending Query Isolated:
By querying pg_stat_activity for the identified PID, we isolated the last command executed by the stuck connection. It is a standard Liquibase schema check query performed at startup.
[postgres@itmxvlpforaio01 ~]$ psql c "SELECT query FROM pg_stat_activity WHERE pid = 18748;"
query
------------------------------------------------------------------------------------
SELECT * FROM public.databasechangelog ORDER BY DATEEXECUTED ASC, ORDEREXECUTED ASC
(1 row)
This proves the bug is triggered during Candlepin's initialization phase.
C. Supporting Evidence from Candlepin/Tomcat Logs:
The tomcat logs generated during Candlepin's startup contain multiple warnings that point to a bug in the transaction manager (JpaLocalTxnInterceptor), which corroborates our findings.
Jun 19 17:06:19 itmxvlpforaio01.it.cobra.group server2026: 19-Jun-2025 17:06:19.858 WARNING [main] com.google.inject.internal.ProxyFactory.<init> Method [public void org.candlepin.model.ConsumerCurator.delete(org.candlepin.model.Persisted)] is synthetic and is being intercepted by [com.google.inject.persist.jpa.JpaLocalTxnInterceptor@8b79b]. This could indicate a bug. The method may be intercepted twice, or may not be intercepted at all.
4. Conclusion
The collected evidence demonstrates that there is a bug in the Candlepin startup routine. The transaction used for the Liquibase schema validation is not being closed correctly, causing a persistent lock that degrades system-wide performance until it becomes unusable. Even after updating all system packages to the latest available versions and rebooting, the issue persists.
We request your intervention for a permanent resolution via a software patch.
We are available to provide any further logs or to perform additional tests as needed.
Best regards.
Files
No data to display