Project

General

Profile

Actions

Bug #12193

closed

Deadlock occuring when creating host

Added by Stefan Julin about 9 years ago. Updated over 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Database
Target version:
Difficulty:
Triaged:
Fixed in Releases:
Found in Releases:

Description

The following happens quite often when trying to insert to nics table when creating a new host:

2015-10-15 11:55:40 [app] [W] Action failed
 | ActiveRecord::StatementInvalid: PGError: ERROR:  deadlock detected
 | DETAIL:  Process 15662 waits for ExclusiveLock on tuple (8,16) of relation 16
559 of database 16384; blocked by process 20700.
 | Process 20700 waits for ShareLock on transaction 299626203; blocked by proces
s 15662.
 | HINT:  See server log for query details.
 | : UPDATE "domains" SET "hosts_count" = COALESCE("hosts_count", 0) + 1 WHERE " 
domains"."id" IN (SELECT "domains"."id" FROM "domains"  WHERE "domains"."id" = 2
 ORDER BY domains.name)
 | /opt/rh/ruby193/root/usr/share/gems/gems/activerecord-3.2.8/lib/active_record
/connection_adapters/postgresql_adapter.rb:1158:in `async_exec'
..
/associations/builder/belongs_to.rb:30:in `block in add_counter_cache_callbacks'
 | /usr/share/foreman/app/models/nic/base.rb:61:in `belongs_to_counter_cache_after_create_for_domain'
... 
 | /usr/share/foreman/app/models/concerns/foreman/sti.rb:29:in `save_with_type'
 | /usr/share/foreman/app/controllers/hosts_controller.rb:84:in `create'
...
 | /usr/share/foreman/app/controllers/concerns/application_shared.rb:13:in `set_
timezone'
[2015-10-14 07:43:19.739 CEST] [12290]: LOG:  process 12290 still waiting for ShareLock on transaction 299488065 after 1000.100 ms
[2015-10-14 07:43:19.739 CEST] [12290]: CONTEXT:  SQL statement "SELECT 1 FROM ONLY "public"."operatingsystems" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR SHARE OF x" 
...
[2015-10-14 07:46:06.028 CEST] [10305]: LOG:  process 10305 still waiting for ShareLock on transaction 299488065 after 1000.123 ms
[2015-10-14 07:46:06.028 CEST] [10305]: STATEMENT:  UPDATE "domains" SET "hosts_count" = COALESCE("hosts_count", 0) - 1 WHERE "domains"."id" IN (SELECT "domains"."id" FROM "domains"  WHERE "domains"."id" = 2 ORDER BY domains.name)
...
[2015-10-14 07:50:21.695 CEST] [12248]: DETAIL:  Process 12248 waits for ShareLock on transaction 299488070; blocked by process 10305.
        Process 10305 waits for ShareLock on transaction 299488101; blocked by process 12248.
        Process 12248: INSERT INTO "nics" ("attached_devices", "attached_to", "attrs", "bond_options", "compute_attributes", "created_at", "domain_id", "host_id", "identifier", "ip", "link", "mac", "managed", "mode", "name", "password", "primary", "provider", "provision", "subnet_id", "tag", "type", "updated_at", "username", "virtual") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25) RETURNING "id" 
        Process 10305: UPDATE "architectures" SET "hosts_count" = COALESCE("hosts_count", 0) - 1 WHERE "architectures"."id" = 1
[2015-10-14 07:50:21.695 CEST] [12248]: HINT:  See server log for query details.
[2015-10-14 07:50:21.695 CEST] [12248]: CONTEXT:  SQL statement "SELECT 1 FROM ONLY "public"."domains" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR SHARE OF x" 
[2015-10-14 07:50:21.695 CEST] [12248]: STATEMENT:  INSERT INTO "nics" ("attached_devices", "attached_to", "attrs", "bond_options", "compute_attributes", "created_at", "domain_id", "host_id", "identifier", "ip", "link", "mac", "managed", "mode", "name", "password", "primary", "provider", "provision", "subnet_id", "tag", "type", "updated_at", "username", "virtual") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25) RETURNING "id" 
[2015-10-14 07:50:22.640 CEST] [12256]: LOG:  process 12256 still waiting for ShareLock on transaction 299488094

We are on postgresql version 8.4.20.

Tomer Brisker told me this:

Looks like ufortunatley rails doesn't allow an after_commit callback, which would update in a different transaction and prevent deadlocks

I applied the patch in http://projects.theforeman.org/issues/5990 but it did not help. We were previously on foreman 1.6.3 and there we did not see any share locks and hanging queries against the database.


Related issues 6 (0 open6 closed)

Related to Foreman - Bug #11444: Domains page shows negative number of hostsClosedTomer Brisker08/21/2015Actions
Related to Foreman - Refactor #12484: Remove useless host group countersResolvedTomer Brisker11/15/2015Actions
Related to Foreman - Bug #11741: Config group host counter not incremented on host creation with associationResolved09/09/2015Actions
Related to Foreman - Bug #16622: ActiveRecord::StatementInvalid exception when accessing Hosts->Operating systems with restricted accessResolvedBrandon Weeks09/20/2016Actions
Related to Foreman - Bug #17666: Upgrade fails during db:migrationClosedTomer Brisker12/13/2016Actions
Related to Foreman - Bug #18260: Choose different org will result in the same org's host number change in organizations pageClosedDaniel Lobato Garcia01/26/2017Actions
Actions #1

Updated by Stefan Julin about 9 years ago

Another example when trying to create two hosts at the same time:

[2015-10-16 13:05:21.931 CEST] [26005]: LOG:  process 26005 still waiting for ShareLock on transaction 299685807 after 1000.079 ms
[2015-10-16 13:05:21.931 CEST] [26005]: CONTEXT:  SQL statement "SELECT 1 FROM ONLY "public"."architectures" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR SHARE OF x" 
[2015-10-16 13:05:21.931 CEST] [26005]: STATEMENT:  INSERT INTO "hosts" ("architecture_id", "build", "certname", "comment", "compute_profile_id", "compute_resource_id", "created_at", "disk", "enabled", "environment_id", "grub_pass", "hostgroup_id", "image_file", "image_id", "installed_at", "last_compile", "last_report", "location_id", "managed", "medium_id", "model_id", "name", "operatingsystem_id", "organization_id", "otp", "owner_id", "owner_type", "provision_method", "ptable_id", "puppet_ca_proxy_id", "puppet_proxy_id", "puppet_status", "realm_id", "root_pass", "type", "updated_at", "use_image", "uuid") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33, $34, $35, $36, $37, $38) RETURNING "id" 
[2015-10-16 13:07:03.397 CEST] [26005]: LOG:  process 26005 acquired ShareLock on transaction 299685807 after 102465.235 ms
[2015-10-16 13:07:03.397 CEST] [26005]: CONTEXT:  SQL statement "SELECT 1 FROM ONLY "public"."architectures" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR SHARE OF x" 
[2015-10-16 13:07:03.397 CEST] [26005]: STATEMENT:  INSERT INTO "hosts" ("architecture_id", "build", "certname", "comment", "compute_profile_id", "compute_resource_id", "created_at", "disk", "enabled", "environment_id", "grub_pass", "hostgroup_id", "image_file", "image_id", "installed_at", "last_compile", "last_report", "location_id", "managed", "medium_id", "model_id", "name", "operatingsystem_id", "organization_id", "otp", "owner_id", "owner_type", "provision_method", "ptable_id", "puppet_ca_proxy_id", "puppet_proxy_id", "puppet_status", "realm_id", "root_pass", "type", "updated_at", "use_image", "uuid") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33, $34, $35, $36, $37, $38) RETURNING "id" 
[2015-10-16 13:07:03.398 CEST] [26005]: LOG:  duration: 102467.738 ms  execute a12: INSERT INTO "hosts" ("architecture_id", "build", "certname", "comment", "compute_profile_id", "compute_resource_id", "created_at", "disk", "enabled", "environment_id", "grub_pass", "hostgroup_id", "image_file", "image_id", "installed_at", "last_compile", "last_report", "location_id", "managed", "medium_id", "model_id", "name", "operatingsystem_id", "organization_id", "otp", "owner_id", "owner_type", "provision_method", "ptable_id", "puppet_ca_proxy_id", "puppet_proxy_id", "puppet_status", "realm_id", "root_pass", "type", "updated_at", "use_image", "uuid") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33, $34, $35, $36, $37, $38) RETURNING "id" 
Actions #2

Updated by Tomer Brisker about 9 years ago

  • Related to Bug #11444: Domains page shows negative number of hosts added
Actions #3

Updated by Tomer Brisker about 9 years ago

  • Category set to Database
  • Priority changed from High to Normal

Deadlocks on domain table should be resolved, or at least reduced, by fix to #11444.
It seems like these deadlocks originate from how rails handles counter_cache, which is somewhat buggy in Rails 3.x (hopefully it will improve with the upgrade to Rails 4.x) in combination with postgresql 8.4 which does a bad job handling locks sometimes (see here for more info).
I'm not sure how much we can improve this on the foreman side other then completely remove cached counters or replace them with a different, concurrence-safe mechanism (such as Counter Culture, which sadly seems abandoned which seems to be maintained sparsly).

Actions #4

Updated by Stefan Julin about 9 years ago

We have now upgraded to postgres 9.4 and get locks related to this (tested with sql query:

SELECT bl.pid                 AS blocked_pid,
         a.usename              AS blocked_user,
         ka.query               AS blocking_statement,
         now() - ka.query_start AS blocking_duration,
         kl.pid                 AS blocking_pid,
         ka.usename             AS blocking_user,
         a.query                AS blocked_statement,
         now() - a.query_start  AS blocked_duration
  FROM  pg_catalog.pg_locks         bl
   JOIN pg_catalog.pg_stat_activity a  ON a.pid = bl.pid
   JOIN pg_catalog.pg_locks         kl ON kl.transactionid = bl.transactionid AND kl.pid != bl.pid
   JOIN pg_catalog.pg_stat_activity ka ON ka.pid = kl.pid
  WHERE NOT bl.GRANTED;

):

There are two records in the result:

Query 1:

DELETE FROM "taxable_taxonomies" WHERE "taxable_taxonomies"."id" = $1

is waiting for this query:

UPDATE "domains" SET "hostgroups_count" = COALESCE("hostgroups_count", 0) - 1 WHERE "domains"."id" IN (SELECT "domains"."id" FROM "domains"  WHERE "domains"."id" = 2 ORDER BY domains.name)

Query 2:

UPDATE "domains" SET "hostgroups_count" = COALESCE("hostgroups_count", 0) - 1 WHERE "domains"."id" IN (SELECT "domains"."id" FROM "domains"  WHERE "domains"."id" = 2 ORDER BY domains.name)

is waiting for this query:

SELECT "hosts".id FROM "hosts"  WHERE "hosts"."type" IN ('Host::Managed') AND "hosts"."hostgroup_id" = 4292

So it looks like the :hostgroups_count logic in the Domain class (mapping to table domains) is making queries wait also in postgresql 9.4.

Actions #5

Updated by Tomer Brisker over 8 years ago

  • Bugzilla link set to 1341338
Actions #6

Updated by Chris Duryee over 8 years ago

Would it be ok to just remove the counter cache? I tested a couple of count(*) queries for 5.5K hosts and they were all pretty fast (less than 3msec).

Actions #7

Updated by Tomer Brisker over 8 years ago

  • Status changed from New to Assigned

Chris - I have started to work in that direction already, once the code is ready I will run some performance testing against a heavy DB.
Removing the cache will give us the advantage of getting taxonomy scoped and authorized counts, at the price of not being able to sort by count, increasing load times and making searching by counts more difficult.

Actions #8

Updated by Lukas Zapletal over 8 years ago

I share the opinion, one big warning - while it might work just fine on PostgreSQL, the other database MySQL can be the issue. If this turns out to be it, we should go ahead and reopen discussion of dropping MySQL from supported databases.

Actions #9

Updated by The Foreman Bot over 8 years ago

  • Status changed from Assigned to Ready For Testing
  • Pull request https://github.com/theforeman/foreman/pull/3596 added
Actions #10

Updated by Anonymous over 8 years ago

  • Status changed from Ready For Testing to Closed
  • % Done changed from 0 to 100
Actions #11

Updated by Dominic Cleal over 8 years ago

  • Translation missing: en.field_release set to 160
Actions #12

Updated by Tomer Brisker about 8 years ago

Actions #13

Updated by Tomer Brisker about 8 years ago

  • Related to Bug #11741: Config group host counter not incremented on host creation with association added
Actions #14

Updated by Dominic Cleal about 8 years ago

  • Related to Bug #16622: ActiveRecord::StatementInvalid exception when accessing Hosts->Operating systems with restricted access added
Actions #15

Updated by Tomer Brisker almost 8 years ago

  • Related to Bug #17666: Upgrade fails during db:migration added
Actions #16

Updated by Dominic Cleal almost 8 years ago

  • Related to Bug #18260: Choose different org will result in the same org's host number change in organizations page added
Actions

Also available in: Atom PDF