Project

General

Profile

Bug #8565

report::expire is running very slowly

Added by Chuck Schweizer over 4 years ago. Updated about 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Reporting
Target version:
Difficulty:
Triaged:
Bugzilla link:
Team Backlog:
Fixed in Releases:
Found in Releases:

Description

The fix included in #1592 has made report::expire very slow

Expire 5 day old reports (runs every 30 minutes)
Pre fix: 2.5 minutes
Post fix: 21 minutes

Expire 1 day old non eventful reports (runs every 30 minutes)
Pre fix: 2.5 minutes
Post fix: 21 minutes

Most of the time seems to be spent in Postgres deleting the records in the post fix report.rb. I will be working on getting more data added to this ticket.


Related issues

Related to Foreman - Bug #1592: report::expire errors with millions of recordsClosed2012-04-30
Related to Foreman - Bug #8316: Removing a long-standing host creates huge transactionClosed2014-11-07
Related to Foreman - Refactor #8798: Reports.expire uses an unnecessary join on logs New2014-12-28

Associated revisions

Revision 9fde85ad (diff)
Added by Dominic Cleal over 4 years ago

fixes #8565 - only get unique, unused message/source IDs to speed up expiry

Revision e79ea15d (diff)
Added by Dominic Cleal over 4 years ago

fixes #8565 - only get unique, unused message/source IDs to speed up expiry

(cherry picked from commit 9fde85ad830031d38a6786d30a48519344dad446)

History

#1 Updated by Chuck Schweizer over 4 years ago

4,000 nodes running puppet once per hour.

#2 Updated by Dominic Cleal over 4 years ago

  • Related to Bug #1592: report::expire errors with millions of records added

#3 Updated by Dominic Cleal over 4 years ago

  • Category set to Reporting
  • Legacy Backlogs Release (now unused) set to 29

#4 Updated by Dominic Cleal over 4 years ago

The most useful info would be debug logs while running the rake task: http://projects.theforeman.org/projects/foreman/wiki/Troubleshooting#How-do-I-enable-debugging

The SQL logs would then show the time taken for each query, and possibly even EXPLAIN data if they appear to take long enough. If not, then we could try re-running some queries with EXPLAIN.

#5 Updated by Dominic Cleal over 4 years ago

  • Related to Bug #8316: Removing a long-standing host creates huge transaction added

#6 Updated by Dominic Cleal over 4 years ago

  • Legacy Backlogs Release (now unused) deleted (29)

#7 Updated by Chuck Schweizer over 4 years ago

Here is where the report is being slow.

 foreman | foreman  |   11783 |             | f       | 2014-12-10 13:52:56.117167+00 | DELETE FROM "messages" WHERE (id not IN (SELECT message_id FROM "logs" ))

 foreman=# explain DELETE FROM "messages" WHERE (id not IN (SELECT message_id FROM "logs" ));
                                 QUERY PLAN                                 
----------------------------------------------------------------------------
 Seq Scan on messages  (cost=38070.64..323418103.91 rows=12289 width=6)
   Filter: (NOT (SubPlan 1))
   SubPlan 1
     ->  Materialize  (cost=38070.64..60375.40 rows=1603876 width=4)
           ->  Seq Scan on logs  (cost=0.00..30200.76 rows=1603876 width=4)
(5 rows)

#8 Updated by Chuck Schweizer over 4 years ago

Explain with DISTINCT

foreman=# explain DELETE FROM "messages" WHERE (id not IN (SELECT DISTINCT message_id from "logs" ));
                                 QUERY PLAN                                 
----------------------------------------------------------------------------
 Seq Scan on messages  (cost=34343.70..36100.92 rows=12289 width=6)
   Filter: (NOT (hashed SubPlan 1))
   SubPlan 1
     ->  HashAggregate  (cost=34314.32..34337.82 rows=2350 width=4)
           ->  Seq Scan on logs  (cost=0.00..30292.46 rows=1608746 width=4)
(5 rows)

#9 Updated by Dominic Cleal over 4 years ago

  • Status changed from New to Assigned
  • Assignee set to Dominic Cleal

#10 Updated by The Foreman Bot over 4 years ago

  • Status changed from Assigned to Ready For Testing
  • Pull request https://github.com/theforeman/foreman/pull/2008 added
  • Pull request deleted ()

#11 Updated by Chuck Schweizer over 4 years ago

Much better with your fix. Down to 12 seconds.

foreman=# explain analyze DELETE FROM "messages" WHERE (id not IN (SELECT DISTINCT message_id from "logs" ));
                                                          QUERY PLAN                                                          
------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on messages  (cost=34423.41..36180.64 rows=12289 width=6) (actual time=699.123..710.044 rows=19 loops=1)
   Filter: (NOT (hashed SubPlan 1))
   SubPlan 1
     ->  HashAggregate  (cost=34394.04..34417.54 rows=2350 width=4) (actual time=670.766..680.408 rows=28760 loops=1)
           ->  Seq Scan on logs  (cost=0.00..30362.83 rows=1612483 width=4) (actual time=0.055..312.759 rows=1613655 loops=1)
 Total runtime: 710.908 ms
(6 rows)
$ time ./foreman-rake reports:expire days=5 RAILS_ENV='production'

real    0m12.144s
user    0m8.497s
sys    0m0.924s

#12 Updated by Dominic Cleal over 4 years ago

  • Related to Refactor #8798: Reports.expire uses an unnecessary join on logs added

#13 Updated by Martin Jackson over 4 years ago

Applying the patch attached to the PR gets our report expiry for ~750,000 reports into the 30 second range.

#14 Updated by Dominic Cleal over 4 years ago

  • Status changed from Ready For Testing to Closed
  • % Done changed from 0 to 100

#15 Updated by Dominic Cleal over 4 years ago

  • Legacy Backlogs Release (now unused) set to 32

#16 Updated by Dirk Heinrichs about 4 years ago

Running 1.7.4, expiring reports still is quite slow (PostgreSQL 9.3.x). Runs longer than a day, so that we run into situations where multiple expire jobs are running, hogging the CPU to 100%. Plan looks like this:

foreman=> select count(*) from reports;
 count
-------
  1951
(1 row)

foreman=> select count(*) from messages;
 count
--------
 704462
(1 row)

foreman=> select count(*) from logs;
 count
--------
 802815
(1 row)

foreman=> explain DELETE FROM "messages" WHERE (id not IN (SELECT DISTINCT message_id FROM "logs" ));
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Delete on messages  (cost=0.42..9806323789.43 rows=352231 width=6)
   ->  Seq Scan on messages  (cost=0.42..9806323789.43 rows=352231 width=6)
         Filter: (NOT (SubPlan 1))
         SubPlan 1
           ->  Materialize  (cost=0.42..27142.52 rows=279366 width=4)
                 ->  Unique  (cost=0.42..24653.69 rows=279366 width=4)
                       ->  Index Only Scan using index_logs_on_message_id on logs  (cost=0.42..22646.65 rows=802815 width=4)
(7 rows)

There's this blog post, which outlines a possible alternative. Another one would be to use foreign keys together with "on delete cascade".

#17 Updated by Dominic Cleal about 4 years ago

It's probably best if you can file a new issue so we can identify the cause there Dirk. Is it definitely that query that's taking the time? EXPLAIN ANALYZE might work better to check.

There are a few, and a different one was identified recently in #10228 where a missing index was slowing down MySQL (at least).

#18 Updated by Dirk Heinrichs about 4 years ago

Yes, it is that query. Found out by running

SELECT * FROM pg_stat_activity;

while connected to the foreman database. And no, I can't run EXPLAIN ANALYZE, as it actually executes the query, which would get me in trouble again.

Reg. #10228: Wonder how an index on source_id would help accelerate a query on message_id.

Anyway, created #10329 as requested.

Also available in: Atom PDF