Project

General

Profile

Bug #12193

Deadlock occuring when creating host

Added by Stefan Julin over 4 years ago. Updated almost 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Database
Target version:
Difficulty:
Triaged:
Bugzilla link:
Fixed in Releases:
Found in Releases:

Description

The following happens quite often when trying to insert to nics table when creating a new host:

2015-10-15 11:55:40 [app] [W] Action failed
 | ActiveRecord::StatementInvalid: PGError: ERROR:  deadlock detected
 | DETAIL:  Process 15662 waits for ExclusiveLock on tuple (8,16) of relation 16
559 of database 16384; blocked by process 20700.
 | Process 20700 waits for ShareLock on transaction 299626203; blocked by proces
s 15662.
 | HINT:  See server log for query details.
 | : UPDATE "domains" SET "hosts_count" = COALESCE("hosts_count", 0) + 1 WHERE " 
domains"."id" IN (SELECT "domains"."id" FROM "domains"  WHERE "domains"."id" = 2
 ORDER BY domains.name)
 | /opt/rh/ruby193/root/usr/share/gems/gems/activerecord-3.2.8/lib/active_record
/connection_adapters/postgresql_adapter.rb:1158:in `async_exec'
..
/associations/builder/belongs_to.rb:30:in `block in add_counter_cache_callbacks'
 | /usr/share/foreman/app/models/nic/base.rb:61:in `belongs_to_counter_cache_after_create_for_domain'
... 
 | /usr/share/foreman/app/models/concerns/foreman/sti.rb:29:in `save_with_type'
 | /usr/share/foreman/app/controllers/hosts_controller.rb:84:in `create'
...
 | /usr/share/foreman/app/controllers/concerns/application_shared.rb:13:in `set_
timezone'
[2015-10-14 07:43:19.739 CEST] [12290]: LOG:  process 12290 still waiting for ShareLock on transaction 299488065 after 1000.100 ms
[2015-10-14 07:43:19.739 CEST] [12290]: CONTEXT:  SQL statement "SELECT 1 FROM ONLY "public"."operatingsystems" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR SHARE OF x" 
...
[2015-10-14 07:46:06.028 CEST] [10305]: LOG:  process 10305 still waiting for ShareLock on transaction 299488065 after 1000.123 ms
[2015-10-14 07:46:06.028 CEST] [10305]: STATEMENT:  UPDATE "domains" SET "hosts_count" = COALESCE("hosts_count", 0) - 1 WHERE "domains"."id" IN (SELECT "domains"."id" FROM "domains"  WHERE "domains"."id" = 2 ORDER BY domains.name)
...
[2015-10-14 07:50:21.695 CEST] [12248]: DETAIL:  Process 12248 waits for ShareLock on transaction 299488070; blocked by process 10305.
        Process 10305 waits for ShareLock on transaction 299488101; blocked by process 12248.
        Process 12248: INSERT INTO "nics" ("attached_devices", "attached_to", "attrs", "bond_options", "compute_attributes", "created_at", "domain_id", "host_id", "identifier", "ip", "link", "mac", "managed", "mode", "name", "password", "primary", "provider", "provision", "subnet_id", "tag", "type", "updated_at", "username", "virtual") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25) RETURNING "id" 
        Process 10305: UPDATE "architectures" SET "hosts_count" = COALESCE("hosts_count", 0) - 1 WHERE "architectures"."id" = 1
[2015-10-14 07:50:21.695 CEST] [12248]: HINT:  See server log for query details.
[2015-10-14 07:50:21.695 CEST] [12248]: CONTEXT:  SQL statement "SELECT 1 FROM ONLY "public"."domains" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR SHARE OF x" 
[2015-10-14 07:50:21.695 CEST] [12248]: STATEMENT:  INSERT INTO "nics" ("attached_devices", "attached_to", "attrs", "bond_options", "compute_attributes", "created_at", "domain_id", "host_id", "identifier", "ip", "link", "mac", "managed", "mode", "name", "password", "primary", "provider", "provision", "subnet_id", "tag", "type", "updated_at", "username", "virtual") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25) RETURNING "id" 
[2015-10-14 07:50:22.640 CEST] [12256]: LOG:  process 12256 still waiting for ShareLock on transaction 299488094

We are on postgresql version 8.4.20.

Tomer Brisker told me this:

Looks like ufortunatley rails doesn't allow an after_commit callback, which would update in a different transaction and prevent deadlocks

I applied the patch in http://projects.theforeman.org/issues/5990 but it did not help. We were previously on foreman 1.6.3 and there we did not see any share locks and hanging queries against the database.


Related issues

Related to Foreman - Bug #11444: Domains page shows negative number of hostsClosed2015-08-21
Related to Foreman - Refactor #12484: Remove useless host group countersResolved2015-11-15
Related to Foreman - Bug #11741: Config group host counter not incremented on host creation with associationResolved2015-09-09
Related to Foreman - Bug #16622: ActiveRecord::StatementInvalid exception when accessing Hosts->Operating systems with restricted accessResolved2016-09-20
Related to Foreman - Bug #17666: Upgrade fails during db:migrationClosed2016-12-13
Related to Foreman - Bug #18260: Choose different org will result in the same org's host number change in organizations pageClosed2017-01-26

Associated revisions

Revision c773c0ef (diff)
Added by Tomer Brisker almost 4 years ago

Fixes #12193 - Stop using counter_cache for host[group] counts

This is aimed to prevent deadlocks caused by Rails' implementation of
counter_cache that is hit when multiple hosts are created or updated at
the same time. An added benefit is that the counters will now display
the number of hosts scoped to the current taxonomy and authorized for
the user to see. It also removes a lot of code used to workaround
various counter_cache issues.
The downsides are that sorting and filtering based on host counts will
no longer work, and that page load times increase slightly for pages
containing host counters, such as enviroments index.

Revision fef6cff7 (diff)
Added by Tomer Brisker almost 4 years ago

Refs #12193 - Remove remaining cached counters

History

#1 Updated by Stefan Julin over 4 years ago

Another example when trying to create two hosts at the same time:

[2015-10-16 13:05:21.931 CEST] [26005]: LOG:  process 26005 still waiting for ShareLock on transaction 299685807 after 1000.079 ms
[2015-10-16 13:05:21.931 CEST] [26005]: CONTEXT:  SQL statement "SELECT 1 FROM ONLY "public"."architectures" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR SHARE OF x" 
[2015-10-16 13:05:21.931 CEST] [26005]: STATEMENT:  INSERT INTO "hosts" ("architecture_id", "build", "certname", "comment", "compute_profile_id", "compute_resource_id", "created_at", "disk", "enabled", "environment_id", "grub_pass", "hostgroup_id", "image_file", "image_id", "installed_at", "last_compile", "last_report", "location_id", "managed", "medium_id", "model_id", "name", "operatingsystem_id", "organization_id", "otp", "owner_id", "owner_type", "provision_method", "ptable_id", "puppet_ca_proxy_id", "puppet_proxy_id", "puppet_status", "realm_id", "root_pass", "type", "updated_at", "use_image", "uuid") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33, $34, $35, $36, $37, $38) RETURNING "id" 
[2015-10-16 13:07:03.397 CEST] [26005]: LOG:  process 26005 acquired ShareLock on transaction 299685807 after 102465.235 ms
[2015-10-16 13:07:03.397 CEST] [26005]: CONTEXT:  SQL statement "SELECT 1 FROM ONLY "public"."architectures" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR SHARE OF x" 
[2015-10-16 13:07:03.397 CEST] [26005]: STATEMENT:  INSERT INTO "hosts" ("architecture_id", "build", "certname", "comment", "compute_profile_id", "compute_resource_id", "created_at", "disk", "enabled", "environment_id", "grub_pass", "hostgroup_id", "image_file", "image_id", "installed_at", "last_compile", "last_report", "location_id", "managed", "medium_id", "model_id", "name", "operatingsystem_id", "organization_id", "otp", "owner_id", "owner_type", "provision_method", "ptable_id", "puppet_ca_proxy_id", "puppet_proxy_id", "puppet_status", "realm_id", "root_pass", "type", "updated_at", "use_image", "uuid") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33, $34, $35, $36, $37, $38) RETURNING "id" 
[2015-10-16 13:07:03.398 CEST] [26005]: LOG:  duration: 102467.738 ms  execute a12: INSERT INTO "hosts" ("architecture_id", "build", "certname", "comment", "compute_profile_id", "compute_resource_id", "created_at", "disk", "enabled", "environment_id", "grub_pass", "hostgroup_id", "image_file", "image_id", "installed_at", "last_compile", "last_report", "location_id", "managed", "medium_id", "model_id", "name", "operatingsystem_id", "organization_id", "otp", "owner_id", "owner_type", "provision_method", "ptable_id", "puppet_ca_proxy_id", "puppet_proxy_id", "puppet_status", "realm_id", "root_pass", "type", "updated_at", "use_image", "uuid") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33, $34, $35, $36, $37, $38) RETURNING "id" 

#2 Updated by Tomer Brisker over 4 years ago

  • Related to Bug #11444: Domains page shows negative number of hosts added

#3 Updated by Tomer Brisker over 4 years ago

  • Category set to Database
  • Priority changed from High to Normal

Deadlocks on domain table should be resolved, or at least reduced, by fix to #11444.
It seems like these deadlocks originate from how rails handles counter_cache, which is somewhat buggy in Rails 3.x (hopefully it will improve with the upgrade to Rails 4.x) in combination with postgresql 8.4 which does a bad job handling locks sometimes (see here for more info).
I'm not sure how much we can improve this on the foreman side other then completely remove cached counters or replace them with a different, concurrence-safe mechanism (such as Counter Culture, which sadly seems abandoned which seems to be maintained sparsly).

#4 Updated by Stefan Julin over 4 years ago

We have now upgraded to postgres 9.4 and get locks related to this (tested with sql query:

SELECT bl.pid                 AS blocked_pid,
         a.usename              AS blocked_user,
         ka.query               AS blocking_statement,
         now() - ka.query_start AS blocking_duration,
         kl.pid                 AS blocking_pid,
         ka.usename             AS blocking_user,
         a.query                AS blocked_statement,
         now() - a.query_start  AS blocked_duration
  FROM  pg_catalog.pg_locks         bl
   JOIN pg_catalog.pg_stat_activity a  ON a.pid = bl.pid
   JOIN pg_catalog.pg_locks         kl ON kl.transactionid = bl.transactionid AND kl.pid != bl.pid
   JOIN pg_catalog.pg_stat_activity ka ON ka.pid = kl.pid
  WHERE NOT bl.GRANTED;

):

There are two records in the result:

Query 1:

DELETE FROM "taxable_taxonomies" WHERE "taxable_taxonomies"."id" = $1

is waiting for this query:

UPDATE "domains" SET "hostgroups_count" = COALESCE("hostgroups_count", 0) - 1 WHERE "domains"."id" IN (SELECT "domains"."id" FROM "domains"  WHERE "domains"."id" = 2 ORDER BY domains.name)

Query 2:

UPDATE "domains" SET "hostgroups_count" = COALESCE("hostgroups_count", 0) - 1 WHERE "domains"."id" IN (SELECT "domains"."id" FROM "domains"  WHERE "domains"."id" = 2 ORDER BY domains.name)

is waiting for this query:

SELECT "hosts".id FROM "hosts"  WHERE "hosts"."type" IN ('Host::Managed') AND "hosts"."hostgroup_id" = 4292

So it looks like the :hostgroups_count logic in the Domain class (mapping to table domains) is making queries wait also in postgresql 9.4.

#5 Updated by Tomer Brisker about 4 years ago

  • Bugzilla link set to 1341338

#6 Updated by Chris Duryee about 4 years ago

Would it be ok to just remove the counter cache? I tested a couple of count(*) queries for 5.5K hosts and they were all pretty fast (less than 3msec).

#7 Updated by Tomer Brisker about 4 years ago

  • Status changed from New to Assigned

Chris - I have started to work in that direction already, once the code is ready I will run some performance testing against a heavy DB.
Removing the cache will give us the advantage of getting taxonomy scoped and authorized counts, at the price of not being able to sort by count, increasing load times and making searching by counts more difficult.

#8 Updated by Lukas Zapletal about 4 years ago

I share the opinion, one big warning - while it might work just fine on PostgreSQL, the other database MySQL can be the issue. If this turns out to be it, we should go ahead and reopen discussion of dropping MySQL from supported databases.

#9 Updated by The Foreman Bot about 4 years ago

  • Status changed from Assigned to Ready For Testing
  • Pull request https://github.com/theforeman/foreman/pull/3596 added

#10 Updated by Anonymous almost 4 years ago

  • Status changed from Ready For Testing to Closed
  • % Done changed from 0 to 100

#11 Updated by Dominic Cleal almost 4 years ago

  • Legacy Backlogs Release (now unused) set to 160

#12 Updated by Tomer Brisker almost 4 years ago

#13 Updated by Tomer Brisker almost 4 years ago

  • Related to Bug #11741: Config group host counter not incremented on host creation with association added

#14 Updated by Dominic Cleal almost 4 years ago

  • Related to Bug #16622: ActiveRecord::StatementInvalid exception when accessing Hosts->Operating systems with restricted access added

#15 Updated by Tomer Brisker over 3 years ago

  • Related to Bug #17666: Upgrade fails during db:migration added

#16 Updated by Dominic Cleal over 3 years ago

  • Related to Bug #18260: Choose different org will result in the same org's host number change in organizations page added

Also available in: Atom PDF