Project

General

Profile

Actions

Bug #10579

open

Unresponsiveness during vmware or rhevm image clone

Added by Ivan Necas about 9 years ago. Updated about 9 years ago.

Status:
New
Priority:
Normal
Assignee:
Category:
Orchestration
Target version:
-
Difficulty:
Triaged:
Fixed in Releases:
Found in Releases:

Description

Cloned from https://bugzilla.redhat.com/show_bug.cgi?id=1208654
Description of problem:

When creating a new vm/host from image (it does not matter if VMware or Rhev, as it happens on both) via rest api Satellite6 drops into a non responsive state and throws away all requests until the vmdk or img clone is done. Once the disk gets released satellite6 is responsive again.

This results in failed registrations of hosts trying to register in any way which is quite critical. It also does not enable the user to create multiple hosts via image at the same time (bulk creation of hosts via api). Meaning the user can only provision sequential and has to wait for each host until it at least uploaded one report.

Version-Release number of selected component (if applicable): 6.0.8

How reproducible:

create a vm(image not pxe) host via api and look at the logs (host becomes non responsive.

Steps to Reproduce:
1. Api call for vmware:

if provider == "vmware"
  1. Choose params
    post_json(url+"hosts", JSON.generate({"host" => {
    "name"=>hostname,
    "organization_id"=>organization_id,
    "location_id"=>location_id,
    "hostgroup_id"=>hostgroup_id,
    "compute_resource_id"=>compute_resource_id,
    "environment_id"=>environment_id,
    "content_source_id"=>"1",
    "managed"=>"true",
    "type"=>"Host::Managed",
    "compute_attributes"=>{"cpus"=>cpus, "corespersocket"=>corespersocket, "memory_mb"=>memory, "cluster"=>vmcluster, "path"=>"/Datacenters/#{datacenter}/vm", "guest_id"=>"otherGuest64", "interfaces_attributes"=>{"new_interfaces"=>{"type"=>"VirtualE1000", "network"=>network, "_delete"=>""}, "0"=>{"type"=>"VirtualE1000", "network"=>network, "_delete"=>""}}, "volumes_attributes"=>{"new_volumes"=>{"datastore"=>datastore, "name"=>"Hard disk", "size_gb"=>disksize, "thin"=>"true", "_delete"=>""}, "0"=>{"datastore"=>datastore, "name"=>"Hard disk", "size_gb"=>disksize, "thin"=>"true", "_delete"=>""}}, "scsi_controller_type"=>"VirtualLsiLogicController", "start"=>"1", "image_id"=>"templates/#{template}"},
    "domain_id"=>domain_id,
    "realm_id"=>"",
    "mac"=>"",
    "subnet_id"=>subnet_id,
    "ip"=>ipaddr,
    "interfaces_attributes"=>{"new_interfaces"=>{"_destroy"=>"false", "type"=>"Nic::Managed", "mac"=>"", "name"=>"", "domain_id"=>"", "ip"=>"", "provider"=>"IPMI"}},
    "architecture_id"=>architecture_id,
    "operatingsystem_id"=>operatingsystem_id,
    "provision_method"=>"image",
    "build"=>"1",
    "root_pass"=>rootpw,
    "medium_id"=>"",
    "disk"=>"",
    "enabled"=>"1",
    "model_id"=>"",
    "comment"=>"",
    "overwrite"=>"false"}}))
    end

if provider == "RedhatVirt"
memsize = memory.to_i * 1024 * 1024
number1 = rand.to_s[2..14]
number2 = rand.to_s[2..14]
post_json(url+"hosts", JSON.generate({"host" => {
"name"=>hostname,
"organization_id"=>organization_id,
"location_id"=>location_id,
"hostgroup_id"=>hostgroup_id,
"compute_resource_id"=>compute_resource_id,
"environment_id"=>environment_id,
"content_source_id"=>"1",
"managed"=>"true",
"type"=>"Host::Managed",
"compute_attributes"=>{"cpus"=>cpus,"cores"=>corespersocket, "memory"=>memsize, "cluster"=>cluster, "interfaces_attributes"=>{"new_interfaces"=>{"name"=>"", "network"=>network, "_delete"=>""}, "new_#{number1}"=>{"name"=>"eth0", "network"=>network, "_delete"=>""}}, "volumes_attributes"=>{"new_volumes"=>{"size_gb"=>"", "storage_domain"=>datastore, "_delete"=>"", "id"=>""}, "new_#{number2}"=>{"size_gb"=>disksize, "storage_domain"=>datastore, "_delete"=>"", "id"=>""}}, "start"=>"1", "image_id"=>template},
"domain_id"=>domain_id,
"realm_id"=>"",
"mac"=>"",
"subnet_id"=>subnet_id,
"ip"=>ipaddr,
"interfaces_attributes"=>{"new_interfaces"=>{"_destroy"=>"false", "type"=>"Nic::Managed", "mac"=>"", "name"=>"", "domain_id"=>"", "ip"=>"", "provider"=>"IPMI"}},
"architecture_id"=>architecture_id,
"operatingsystem_id"=>operatingsystem_id,
"provision_method"=>"image",
"build"=>"1",
"disk"=>"",
"root_pass"=>rootpw,
"enabled"=>"1",
"model_id"=>"",
"comment"=>"",
"overwrite"=>"false"}}))
end

2. Wait unit it breaks.

Actual results:
Sat6 is not responsive during a image clone.

Expected results:
Beeing able to communicate with sat6 during image clone.

Additional info:

This was tested on a BL465g6 with an SB40 (6x15k disks) So its not a hardware issue.

Actions #1

Updated by Ivan Necas about 9 years ago

  • Category set to Orchestration

The problem is, when the provisioning starts, it takes the one process that is available in the passenger and blocks that while waiting for the provisioning to finish (the finish-script). When the second request comes, The second request comes, the new passenger process starts, but it takes a while for the environment to be ready, leading to foreman act unresponsive and causing timeouts.

The long term solution should be to move the provisioning into asynchronous mode (using foreman-tasks)

The temporary workaround should be introducing installer params for setting the minimal process for the passenger, so that when one request comes to the foreman for provisioning, there are other processes still ready to handle the other requests. However, even in this case, the user is responsible for keeping the number of such requests low. Also, the time spent in the foreman process when provisioning can be lowered by preferring cloud-init over finish scripts.

Actions #2

Updated by Dominic Cleal about 9 years ago

  • Subject changed from Satellite6 is unresponsive during vmware or rhevm image clone to Unresponsiveness during vmware or rhevm image clone

Also reduce the number of plugins and things installed and startup time will be quick, as it is in a default Foreman installation.

Actions

Also available in: Atom PDF