Project

General

Profile

Actions

Bug #17554

closed

ISC dhcpd provider leases monitor doesn't keep up with lease file updates in a large network

Added by Anonymous over 7 years ago. Updated almost 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
DHCP
Target version:
Difficulty:
Triaged:
Fixed in Releases:
Found in Releases:

Related issues 1 (1 open0 closed)

Related to Smart Proxy - Bug #17373: ISC dhcp provider is unable to handle very big networksNewActions
Actions #1

Updated by The Foreman Bot over 7 years ago

  • Status changed from New to Ready For Testing
  • Pull request https://github.com/theforeman/smart-proxy/pull/479 added
Actions #2

Updated by Anonymous over 7 years ago

  • Related to Bug #17373: ISC dhcp provider is unable to handle very big networks added
Actions #3

Updated by Anonymous over 7 years ago

Konstantin, what ruby version are you using? I think the PR will resolve the problem you are seeing, but I couldn't verify it, as I cannot get inotify queue to overflow no matter how hard I tried.

Actions #4

Updated by Konstantin Orekhov over 7 years ago

I run ruby 2.0 included with CentOS 7:

ruby 2.0.0p598 (2014-11-13) [x86_64-linux]

I'll test your PR and let you know.

Thanks for staying on top of this!

Actions #5

Updated by Konstantin Orekhov over 7 years ago

Alas, I don't see any improvements in large deployment - as soon as inotify events start happening (i'm rsync'ing dhcpd.leases from a prod DHCP server into Foreman smart-proxy VM using syncd1).

Please let me know if you want to see a debug log or something...

[1] - https://github.com/drunomics/syncd

Actions #6

Updated by Konstantin Orekhov over 7 years ago

Well, to follow up on what I mentioned above - there's a difference in inotify events on a system that actually runs ISC DHCP (small environment):

D, [2016-12-01T16:11:38.475486 #1227] DEBUG -- : caught :modify event on /dhcp/conf/dhcpd.leases.
D, [2016-12-01T16:12:36.409427 #1227] DEBUG -- : caught :modify event on /dhcp/conf/dhcpd.leases.
D, [2016-12-01T16:13:05.835620 #1227] DEBUG -- : caught :modify event on /dhcp/conf/dhcpd.leases.
D, [2016-12-01T16:13:17.342263 #1227] DEBUG -- : caught :modify event on /dhcp/conf/dhcpd.leases.
D, [2016-12-01T16:13:45.384206 #1227] DEBUG -- : caught :modify event on /dhcp/conf/dhcpd.leases.
D, [2016-12-01T16:13:48.839135 #1227] DEBUG -- : caught :modify event on /dhcp/conf/dhcpd.leases.

And the systems I replicate the change to with syncd (aka rsync triggered by inotify events):

D, [2016-12-01T16:12:39.147068 #30776] DEBUG -- : caught :moved_to event on /dhcp/conf/dhcpd.leases.
D, [2016-12-01T16:13:08.279437 #30776] DEBUG -- : caught :moved_to event on /dhcp/conf/dhcpd.leases.
D, [2016-12-01T16:13:20.252106 #30776] DEBUG -- : caught :moved_to event on /dhcp/conf/dhcpd.leases.
D, [2016-12-01T16:13:47.816966 #30776] DEBUG -- : caught :moved_to event on /dhcp/conf/dhcpd.leases.
D, [2016-12-01T16:13:50.247491 #30776] DEBUG -- : caught :moved_to event on /dhcp/conf/dhcpd.leases.

And what I noticed is that in a first case only the changes are loaded, but in second case the whole file is reloaded, which takes ~30 second each time. So this is definitely contributes to the issue I'm observing.
I'll try to figure out the way to test your PR on actual prod DHCP server and see how the performance over there.

Meanwhile, I have a question for you, Dmitri - is it possible to handle :moved_to event the same way as :modify and load only the differences?

Actions #7

Updated by Anonymous over 7 years ago

Meanwhile, I have a question for you, Dmitri - is it possible to handle :moved_to event the same way as :modify and load only the differences?

moved_to is meant to catch dhcpd recreating leases file, which happens every hour. Incremental updates won't work here -- pre-recreate event file descriptor we are holding isn't valid anymore (is not pointing to the correct leases file anymore to be more specific).

Actions #8

Updated by Anonymous over 7 years ago

You might be able to get syncd to work with smart-proxy, but this will require changes to syncd: on modify event it will need to use "--inplace --append" rsync options. moved_to event handler doesn't appear to need any changes.

Actions #9

Updated by Konstantin Orekhov over 7 years ago

Dmitri Dolguikh wrote:

You might be able to get syncd to work with smart-proxy, but this will require changes to syncd: on modify event it will need to use "--inplace --append" rsync options. moved_to event handler doesn't appear to need any changes.

Yes, that trick worked - thanks a lot! However, syncd/rsync seem not be stable enough in large environment (dozens of changes per second) - I noticed that after several hours of rsync'ing dhcpd.leases to 2nd and 3rd Foreman nodes, something gets a "lock" (for the lack of better term) on dhcpd.leases on a destination host so updates stop coming in. And I can't figure out what locks that file - lsof does not show anything, but I know to get out of that condition I have to remove a file so updates start coming in. But that's not all - have to stop/start smart-proxy (restart does not help) as smart-proxy log would show that :modify events detected, but it would not load changes. If you have more ideas on this - I'm all ears.

The other thought I'm having is to go back to shared FS (NFS or GlusterFS maybe) to avoid having to rsync that often. The concern you had with NFS (or any shared FS I assume) is that changes to dhcpd.leases are only detected by inotify only on a machine that actually makes that change (the node that runs dhcpd daemon). So, if that's the only concern, I wonder if there's any other way to poke smart-proxy on 2nd/3rd node to reload dhcpd.leases just as if it detected a change through inotify?
If that's a possibility, syncd maybe used to run that command on 2nd and 3rd nodes instead of starting an rsync to them. Thoughts?

Actions #10

Updated by Anonymous over 7 years ago

I noticed that after several hours of rsync'ing dhcpd.leases to 2nd and 3rd Foreman nodes, something gets a "lock" (for the lack of better term) on dhcpd.leases on a destination host so updates stop coming in. And I can't figure out what locks that file - lsof does not show anything, but I know to get out of that condition I have to remove a file so updates start coming in.

This isn't enough information to diagnose the problem. Are incremental updates being used for move_to events to? If so, this might be the problem: dhcpd recreates leases file once every hour and appends changes to the leases file at all other times. It's imperative that the script performing rsync uses incremental updates on modify events (otherwise the whole leases file is parsed on every update) and completely overwrites the file on move_to events (otherwise issues with read offset in the already open file).

So, if that's the only concern, I wonder if there's any other way to poke smart-proxy on 2nd/3rd node to reload dhcpd.leases just as if it detected a change through inotify?

If you can tolerate ~30sec? delay (that's roughly the time it takes smart-proxy to load in your network?) during failover, you could keep the leases file synchronized by w/e means you deem suitable, and start/restart smart-proxy when the main node fails (the leases file must be on the local filesystem before smart proxy is launched though).

Actions #11

Updated by Anonymous over 7 years ago

  • Status changed from Ready For Testing to Closed
  • % Done changed from 0 to 100
Actions #12

Updated by Dominic Cleal over 7 years ago

  • translation missing: en.field_release set to 210
Actions #13

Updated by Konstantin Orekhov over 7 years ago

  • translation missing: en.field_release deleted (210)

This isn't enough information to diagnose the problem. Are incremental updates being used for move_to events to? If so, this might be the problem: dhcpd recreates leases file once every hour and appends changes to the leases file at all other times. It's imperative that the script performing rsync uses incremental updates on modify events (otherwise the whole leases file is parsed on every update) and completely overwrites the file on move_to events (otherwise issues with read offset in the already open file).

As per your advice, I've modified rsync options to append, which results in :modify events on a receiving side. Let me separate rsync calls and use different options depending on the event type detected on originating node.

If you can tolerate ~30sec? delay (that's roughly the time it takes smart-proxy to load in your network?) during failover, you could keep the leases file synchronized by w/e means you deem suitable, and start/restart smart-proxy when the main node fails (the leases file must be on the local filesystem before smart proxy is launched though).

Well, each of my Foreman installations is an active/active 3-node "cluster" behind an LB. Only one node runs dhcpd, but all 3 serve as Foreman and as TFTP, BMC, puppet, etc. smart-proxy. I've been trying to do the same for DHCP smart-proxy. Both DHCP and TFTP data is on NFS-mounted FS at this point. With your latest performance-related changes, things improved a lot for me in this setup, but losing NFS support is a problem. Looks like 3-way syncd for TFTP is working rather good, so now I just need to stabilize DHCP portion. The way I'm trying to make it work is the fact that only active dhcpd node would push the changes to other 2 nodes. When/if failover happens and other node picks up dhcpd, it'll become the active one and start shipping changes to other 2 nodes. Hopefully even w/o restarts of smart-proxy processes. But the latter is not so important as it could be done as part of failover procedure by pacemaker.

Let me continue experiments here. Also, I'd like to say one more time - I very much appreciate your attention to this matter. Most of people do not see these kind of problems because of the scale of the environment. I do and since I like Foreman product, I'd like to make it better and more scalable and thus consumable by large audiences.

Thanks!

Actions #14

Updated by Dominic Cleal over 7 years ago

  • translation missing: en.field_release set to 210
Actions

Also available in: Atom PDF