Project

General

Profile

Actions

Feature #31995

open

Implement maximum NICs per host safety measure

Added by Florian Rosenegger almost 4 years ago. Updated almost 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Difficulty:
Triaged:
No
Fixed in Releases:
Found in Releases:

Description

Hello,

we are running a Vanilla Foreman (currently Foreman 2.3.3) on Debian 10 Buster.
The Server has at the moment 4 Cores and 24 Gig of Memory. Before the upgrade from Foreman 2.2.x everything was working quite ok
We started with plain apache+passenger setup, but i migrated to apache+mod_proxy+puma today to test if the error is still the same.

At the moment the puma is running with:
Environment=MALLOC_ARENA_MAX=2
Environment=FOREMAN_PUMA_THREADS_MIN=2
Environment=FOREMAN_PUMA_THREADS_MAX=4
Environment=FOREMAN_PUMA_WORKERS=2

I tried several variations of Threads MIN/MAX/WORKERS without success yet.

So attached a screenshot of the memory consumption of this server during the last 25 days, the last 4 hours and also a cpu consumption during the last 25 days.
As you can see, since the upgrade we need a lot more of memory, and we peak out all the cpus the server has (keep in mind that this was done with passenger, before today).
Here is a little copy of the top lines of top ;):
%Cpu(s): 56.3 us, 3.2 sy, 0.0 ni, 38.6 id, 0.7 wa, 0.0 hi, 1.2 si, 0.0 st
MiB Mem : 24102.6 total, 8381.9 free, 14516.3 used, 1204.4 buff/cache
MiB Swap: 192.0 total, 41.0 free, 151.0 used. 8599.9 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
23151 foreman 20 0 7970396 7.0g 9328 S 100.0 29.8 27:58.10 foreman-ruby
23144 foreman 20 0 6860460 6.0g 9256 S 101.0 25.7 24:27.77 foreman-ruby

I also did an analyze of the production.log to help you guys figure this out:
Tue 02 Mar 2021 03:42:05 PM UTC: summarizing stats..

there were 15897 requests taking 30739314 ms (i.e. 8.54 hours, i.e. 0.36 days) in summary

type count min max avg mean supercentage
--------------------------------------------------------------------------------------------------------------------
AboutController#index 1 838 838 838 838 830.00 %
ConfigReportsController#create 9989 25 175400 1113 380 11123334 36.19 %
FactValuesController#index 8 376 11111 3218 1062 25746 0.08 %
HostsController#facts 4089 22 1778503 4518 1566 18475198 60.10 %
HostsController#power_status 642 23 20147 358 107 230143 0.75 %
ComputeResourcesController#index 1 85 85 85 85 850.00 %
ComputeResourcesController#ping 3 892 6217 3005 1907 9016 0.03 %
ComputeResourcesController#show 1 521 521 521 521 520.00 %
ComputeResourcesVmsController#index 1 4381 4381 4381 4381 4381 0.01 %
ConfigReportsController#index 4 290 12323 3310 308 13242 0.04 %
ConfigReportsController#show 5 109 353 211 194 1056 0.00 %
DashboardController#destroy 3 55 75 65 65 190.00 %
DashboardController#index 647 14 10783 165 52 107193 0.35 %
DashboardController#show 130 45 29988 2627 491 341577 1.11 %
FactValuesController#auto_complete_search 9 14 40948 4577 29 41196 0.13 %
FactValuesController#index 6 499 1617 1127 1273 6762 0.02 %
HostgroupsController#index 1 7382 7382 7382 7382 7382 0.02 %
HostsController#auto_complete_search 3 29 2219 812 190 2438 0.01 %
HostsController#cancelBuild 1 5808 5808 5808 5808 5808 0.02 %
HostsController#destroy 3 320 13146 6539 6152 19618 0.06 %
HostsController#externalNodes 1 37 37 37 37 370.00 %
HostsController#index 22 3 32591 3430 719 75469 0.25 %
HostsController#multiple_destroy 1 3762 3762 3762 3762 3762 0.01 %
HostsController#nics 3 121 201 156 148 470.00 %
HostsController#overview 3 71 362 255 333 760.00 %
HostsController#resources 3 28 54 45 53 130.00 %
HostsController#review_before_build 1 5072 5072 5072 5072 5072 0.02 %
HostsController#runtime 3 50 123 95 113 280.00 %
HostsController#setBuild 1 9558 9558 9558 9558 9558 0.03 %
HostsController#show 3 183 487 297 222 890.00 %
HostsController#submit_multiple_destroy 1 45767 45767 45767 45767 45767 0.15 %
HostsController#templates 3 526 616 563 549 1691 0.01 %
HostsController#vm 3 1100 2373 1658 1501 4974 0.02 %
ImagesController#index 1 78 78 78 78 780.00 %
NotificationRecipientsController#index 277 9 18938 404 22 111987 0.36 %
SettingsController#index 4 3 6215 1747 236 6991 0.02 %
SmartProxiesController#index 1 14054 14054 14054 14054 14054 0.05 %
SmartProxiesController#ping 8 400 13573 4373 1327 34989 0.11 %
UnattendedController#built 1 3158 3158 3158 3158 3158 0.01 %
UnattendedController#host_template 2 527 729 628 527 1256 0.00 %
UsersController#edit 1 300 300 300 300 300.00 %
UsersController#login 7 17 806 270 232 1893 0.01 %

concurrent requests:
- MAX: 106 when processing request with ID 'fe7860d6'
- AVG: 55
- MEAN: 59
- 90%PERCENTILE: 88

I'm not sure what else i can do do improve the situation - i would be glad for any hints.
Thanks!


Files

last 4 hours.jpg View last 4 hours.jpg 35.4 KB Florian Rosenegger, 03/02/2021 03:39 PM
last_25_days.jpg View last_25_days.jpg 39.1 KB Florian Rosenegger, 03/02/2021 03:39 PM
last_25_days_cpu.jpg View last_25_days_cpu.jpg 47.8 KB Florian Rosenegger, 03/02/2021 03:43 PM
Actions #1

Updated by Florian Rosenegger almost 4 years ago

After some more investigation - it is definitely something broken with the fact handling. As soon as i disabled fact pushing from the puppetservers (i use the node.rb script with --push-facts by cronjob), the cpu spikes and memory leaks disappeared.

I circled in closer and it seems he is trying to update some managed nic on a server and continues with the job even after apache2 was stopped and there where no further updates.
A quick look at the facts showed an interface combination with some vlans and a bridge.

Yes the host has 80 interfaces (tap-interfaces as its a virtualization host), but a quick look into the database showed a slightly different picture:

select count(*), host_id, identifier from nics group by host_id, identifier order by count(*) asc;
    10 |     988 | vmbr0v82.87
    16 |     988 | vmbr0v50.71
    17 |     988 | vmbr0v33.33
    18 |     988 | eno2.92
    21 |     988 | vmbr0v34.71
    24 |     988 | vmbr0v50.86
    25 |     988 | vmbr0v34.79
    26 |     988 | vmbr0v50.85
    41 |     988 | vmbr0v33.34
    47 |     988 | vmbr0v50.34
    64 |     988 | vmbr0.96
    75 |     988 | vmbr0.95
    82 |     988 | vmbr0v34.50
   101 |     988 | vmbr0.94
   107 |     988 | vmbr0v50.33
   118 |     988 | vmbr0.93
   133 |     988 | vmbr0.92
   136 |     988 | eno2.71
   141 |     988 | vmbr0v33.71
   167 |     988 | eno2.91
   190 |     988 | eno2.90
   197 |     988 | eno2.87
   197 |     988 | eno2.86
   208 |     988 | eno2.85
   238 |     988 | vmbr0v33.50
   250 |     988 | vmbr0v34.33
   251 |     988 | eno2.84
   263 |     988 | vmbr0v34.34
   268 |     988 | eno2.79
   271 |     988 | eno2.83
   273 |     988 | eno2.82

As soon as i deleted the interfaces (by sql - as deletion of the host in foreman was not possible => same high cpu spike....), foreman started to behave normal.

At the moment i'm not sure what triggered the duplication of interfaces in foreman and i can't reproduce the issue.

So i guess you can close this bug and i least hope this could help someone else to identify similar OOM issues.

Greetings,
Florian

Actions #2

Updated by Lukas Zapletal almost 4 years ago

Hello, first of - well described. I wish all the bugreports looked like yours.

Your number confirms what we see - most of the load comes from facts and reports. In fact, we are rewriting reports from scratch at the moment and facts should be up next. The most important change will be that instead of consuming all facts we will be cherry picking just core facts. Instead of block list users will have an allow list.

Now, to your problem. Can you confirm that the upgrade from 2.2 to 2.3 caused this fact regression? I can then take a look in the codebase to see what has changed.

Generally there are two problems with facts.

A) Foreman can update some database fields from facts, e.g. operating system, domain and more importantly NICs. Very often this NIC updating is useless and it takes a lot of resources (CPU, SQL). You can turn parsing of NICs in settings, also you can exclude some interface identifiers from being processed.

B) Too many facts. This happens when you have disk arrays with thousands of disks, many VMs/containers (NICs) etc. There is also a filter available in settings. This also should solve (A) if you filter out network interfaces. In recent versions when there is a structured fact with many nodes, Foreman drops those which are higher than 100 nodes.

Let me know if filtering helps.

Edit: After you apply the filters you need to delete the offending NICs (see your SQL select above) and also facts. There is a rake task for that.

https://theforeman.org/2018/03/foreman-117-fact-filtering.html

I think in your case you want to set filter for vmbr*.

Actions #3

Updated by Florian Rosenegger almost 4 years ago

Hi Lukas,

i tried my best to sum everything up - as it helped me too, to unterstand the issue.

At the moment i can't tell for sure if the problem came with the update or with some foreman vm crashes during fact import as we had some issues with our virtualisation.

I cleaned up the interfaces and extended the filters you mentioned and the load is now totally ok.

But what is important:
- There is a codepath somewhere that tries to update managed nics, that added loads and loads of duplicated interfaces for one host
which leads to:
- If there are too many interfaces for one host - it ends in high cpu/high mem consumption until the service is restarted or the process killed by OOM Killer (the latter happend by us)

So maybe there is some possibility to implement some saftey mechanisms in case of duplicate interfaces (same name for same host?).

Greetings,
Florian

Actions #4

Updated by Lukas Zapletal almost 4 years ago

  • Tracker changed from Bug to Feature
  • Subject changed from Possible Memory Leak / Out of Memory Issue to Implement maximum NICs per host safety measure

We do have a maximum amount of facts per subtree safety measure but it looks like NICs can still sneak into the database by constantly adding more and more interfaces. The default maximum amount of facts is set to 100 but facter reports them in a random order or they can also be short-lived interfaces. This can make the list of NICs to grow.

Updates of NICs (and hosts) in Foreman has a lot of code and it is slow to be called 100x times per request. Thefore I am turning this ticket into a Feature request to implement maximum amount of NICs per host. When exceeded, the transaction must fail with error causing the fact upload to fail too so errors are determined early when there is still time to correct things.

Actions

Also available in: Atom PDF