Bug #8565
closed
report::expire is running very slowly
Added by Chuck Schweizer over 10 years ago.
Updated almost 7 years ago.
Description
The fix included in #1592 has made report::expire very slow
Expire 5 day old reports (runs every 30 minutes)
Pre fix: 2.5 minutes
Post fix: 21 minutes
Expire 1 day old non eventful reports (runs every 30 minutes)
Pre fix: 2.5 minutes
Post fix: 21 minutes
Most of the time seems to be spent in Postgres deleting the records in the post fix report.rb. I will be working on getting more data added to this ticket.
4,000 nodes running puppet once per hour.
- Related to Bug #1592: report::expire errors with millions of records added
- Category set to Reporting
- Translation missing: en.field_release set to 29
- Related to Bug #8316: Removing a long-standing host creates huge transaction added
- Translation missing: en.field_release deleted (
29)
Here is where the report is being slow.
foreman | foreman | 11783 | | f | 2014-12-10 13:52:56.117167+00 | DELETE FROM "messages" WHERE (id not IN (SELECT message_id FROM "logs" ))
foreman=# explain DELETE FROM "messages" WHERE (id not IN (SELECT message_id FROM "logs" ));
QUERY PLAN
----------------------------------------------------------------------------
Seq Scan on messages (cost=38070.64..323418103.91 rows=12289 width=6)
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=38070.64..60375.40 rows=1603876 width=4)
-> Seq Scan on logs (cost=0.00..30200.76 rows=1603876 width=4)
(5 rows)
Explain with DISTINCT
foreman=# explain DELETE FROM "messages" WHERE (id not IN (SELECT DISTINCT message_id from "logs" ));
QUERY PLAN
----------------------------------------------------------------------------
Seq Scan on messages (cost=34343.70..36100.92 rows=12289 width=6)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> HashAggregate (cost=34314.32..34337.82 rows=2350 width=4)
-> Seq Scan on logs (cost=0.00..30292.46 rows=1608746 width=4)
(5 rows)
- Status changed from New to Assigned
- Assignee set to Dominic Cleal
- Status changed from Assigned to Ready For Testing
- Pull request https://github.com/theforeman/foreman/pull/2008 added
- Pull request deleted (
)
Much better with your fix. Down to 12 seconds.
foreman=# explain analyze DELETE FROM "messages" WHERE (id not IN (SELECT DISTINCT message_id from "logs" ));
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Seq Scan on messages (cost=34423.41..36180.64 rows=12289 width=6) (actual time=699.123..710.044 rows=19 loops=1)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> HashAggregate (cost=34394.04..34417.54 rows=2350 width=4) (actual time=670.766..680.408 rows=28760 loops=1)
-> Seq Scan on logs (cost=0.00..30362.83 rows=1612483 width=4) (actual time=0.055..312.759 rows=1613655 loops=1)
Total runtime: 710.908 ms
(6 rows)
$ time ./foreman-rake reports:expire days=5 RAILS_ENV='production'
real 0m12.144s
user 0m8.497s
sys 0m0.924s
- Related to Refactor #8798: Reports.expire uses an unnecessary join on logs added
Applying the patch attached to the PR gets our report expiry for ~750,000 reports into the 30 second range.
- Status changed from Ready For Testing to Closed
- % Done changed from 0 to 100
- Translation missing: en.field_release set to 32
Running 1.7.4, expiring reports still is quite slow (PostgreSQL 9.3.x). Runs longer than a day, so that we run into situations where multiple expire jobs are running, hogging the CPU to 100%. Plan looks like this:
foreman=> select count(*) from reports;
count
-------
1951
(1 row)
foreman=> select count(*) from messages;
count
--------
704462
(1 row)
foreman=> select count(*) from logs;
count
--------
802815
(1 row)
foreman=> explain DELETE FROM "messages" WHERE (id not IN (SELECT DISTINCT message_id FROM "logs" ));
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Delete on messages (cost=0.42..9806323789.43 rows=352231 width=6)
-> Seq Scan on messages (cost=0.42..9806323789.43 rows=352231 width=6)
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=0.42..27142.52 rows=279366 width=4)
-> Unique (cost=0.42..24653.69 rows=279366 width=4)
-> Index Only Scan using index_logs_on_message_id on logs (cost=0.42..22646.65 rows=802815 width=4)
(7 rows)
There's this blog post, which outlines a possible alternative. Another one would be to use foreign keys together with "on delete cascade".
It's probably best if you can file a new issue so we can identify the cause there Dirk. Is it definitely that query that's taking the time? EXPLAIN ANALYZE
might work better to check.
There are a few, and a different one was identified recently in #10228 where a missing index was slowing down MySQL (at least).
Yes, it is that query. Found out by running
SELECT * FROM pg_stat_activity;
while connected to the foreman database. And no, I can't run EXPLAIN ANALYZE, as it actually executes the query, which would get me in trouble again.
Reg. #10228: Wonder how an index on source_id would help accelerate a query on message_id.
Anyway, created #10329 as requested.
Also available in: Atom
PDF