Bug #2687
closedPerformance issues with large ISC dataset (DHCP smart proxy)
Description
Hello all!
While working on adding support for more standard DHCP options while creating host reservations (pull request https://github.com/theforeman/smart-proxy/pull/97), which includes global search based on MAC, IP or hostname (/dhcp/find/<record> you'll see below), I did a lot of testing on production dataset.
As I've already reported previously in other threads, I'm running into serious performance issues with DHCP smart-proxy using ISC DHCP backend.
Below is the data I collected executing various (local) API calls to DHCP proxy running on the following HW:
$ facter | egrep "proc|mem"
memoryfree => 124.06 GB
memorysize => 125.76 GB
memorytotal => 125.76 GB
physicalprocessorcount => 2
processor0 => Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
<32 procs>
processorcount => 32
ISC DHCP dataset: 7656 subnets with 50848 leases
I've tried both WEBrick and Apache/Passenger, but that made no difference in the API response times, so I'm going to list the details from WEBrick exercise only. As you will see below, some (major from functionality point of view) calls could not even complete within 10 minutes, so those were interrupted with ^C:
$ time curl -3 -H "Accept:application/json" -k -X GET https://localhost:8443/dhcp
real 0m37.618s
user 0m0.015s
sys 0m0.016s
$ time curl -3 -H "Accept:application/json" -k -X GET https://localhost:8443/dhcp/169.254.1.0
{"reservations":[],"leases":[]}
real 1m8.808s
user 0m0.012s
sys 0m0.008s
$ time curl -3 -H "Accept:application/json" -k -X GET https://localhost:8443/dhcp/169.254.1.0/00:50:56:39:ac:40
Record 169.254.1.0/00:50:56:39:ac:40 not found
real 1m8.572s
user 0m0.020s
sys 0m0.000s
$ time curl -3 -H "Accept:application/json" -k -X POST https://localhost:8443/dhcp/169.254.1.0 -d 'mac=00:50:56:39:ac:40' -d 'ip=169.254.1.203' -d 'hostname=blah'
^C
real 10m24.368s
user 0m0.012s
sys 0m0.024s
<had to create the above record through omshell>
$ time curl -3 -H "Accept:application/json" -k -X GET https://localhost:8443/dhcp/169.254.1.0/00:50:56:39:ac:40
{"ip":"169.254.1.203","hostname":"blah","mac":"00:50:56:39:ac:40","subnet":"169.254.1.0/255.255.255.0"}
real 1m8.628s
user 0m0.016s
sys 0m0.008s
$ time curl -3 -H "Accept:application/json" -k -X GET https://localhost:8443/dhcp/find/00:50:56:39:ac:40
^C
real 10m39.027s
user 0m0.020s
sys 0m0.016s
$ time curl -3 -H "Accept:application/json" -k -X DELETE https://localhost:8443/dhcp/169.254.1.0/00:50:56:39:ac:40
real 1m9.113s
user 0m0.012s
sys 0m0.012s
As you can see, the time it takes even on successful calls is clearly unacceptable on large datasets and IMHO, some refactoring has to be done.
While browsing the code, it became clear to me that creating a subnet/lease maps on each request is the major contributor to this problem. Also, some extra validations before creating records, for example, is really unnecessary as omshell will do that validation much faster. I believe that at least for the following operations smart-proxy must provide much thinner layer (almost a pass-through) to omshell - create, search and delete host reservation.
Not all of the operations are available through omshell (like getting a list of subnets or leases/hosts in a particular subnet), so that obviously should stay, maybe it can be improved in some ways to help to speed things up. The other example of something that cannot be done through omshell is getting all the options for a particular record, but creating a full map of subnets/leases and then parsing through that is not the most efficient way as only dhcpd.leases needs to be parsed, which would be much faster.
These just my thoughts on some of the points and I'd like to hear your thoughts on this.
Thanks!
Files
Updated by Ohad Levy over 11 years ago
- Status changed from New to Feedback
First... do you really have a huge single dhcp infrastructure ? (e.g. on single service/server) or is it just a collection of all dhcp configs?
There are multiple ways to solve the problem, the main reason for the proxy loading the entire configs each time was simply that we don't need to know when to invalidate the cache (its pretty easy storing the @server or @subnet objects in memory/cache).
I'm pretty sure, that given some free time, it would be possible to understand what is the slowest part now, and maybe potentially remove that limitation, either as you said (lazy loading, not loading at all etc) or simply by figuring out whats slow (maybe using a profiler, debugger, logging etc) and trying to address that.
eventually, if thats not enough, we could / should look into caching the data and expire it after some time etc.
What are your thoughts? I personally don't have access to such a large dataset, can you share it?
Updated by Konstantin Orekhov over 11 years ago
Yes, this is a single dataset from one of the datacenters, and we have several of those. We do have our reasons not to split that into smaller pieces and on its own is a separate topic. At this point my goal is make Foreman smart-proxy a lot more efficient to be able to cope with any size of the dataset thrown at it.
I'm checking internally if I can share that with you under NDA we have with RedHat, but it may be faster to find a way to generate that set on your own. I'll take a look into masking our data with some fakes for you.
In terms of the cache. Not sure if it is good idea in places where direct interaction with omshell is possible. For example, for record validation/global search based on hostname, IP or MAC, the following code works like a charm, no cache/map is needed:
obj: <null> > new host obj: host > set name = "some" obj: host name = "some" > open obj: host name = "some" ip-address = 0a:6d:56:c7 hardware-address = ff:ff:ff:ff:ff:ff hardware-type = 00:00:00:01 >
Or during host create operation, why not to solely rely on omshell to do the validation like this?
> obj: <null> > new host obj: host > set hardware-address = ff:ff:ff:ff:ff:ff obj: host hardware-address = ff:ff:ff:ff:ff:ff > set ip-address = 1.1.1.1 obj: host hardware-address = ff:ff:ff:ff:ff:ff ip-address = 01:01:01:01 > set name = "blah" obj: host hardware-address = ff:ff:ff:ff:ff:ff ip-address = 01:01:01:01 name = "blah" > create can't open object: key conflict obj: host hardware-address = ff:ff:ff:ff:ff:ff ip-address = 01:01:01:01 name = "blah" >
One thing is still unclear to me. What is the significance of the subnet information for DHCP record in Foreman? Why user has to provide a subnet and IP for a new record in its API call, while in the background the subnet info is being dropped and not used by omshell at all? Is the subnet info returned by the sample call below is used by Foreman anywhere later on?
# curl -k -X GET https://dhcp.vip:8443/dhcp/10.109.86.0/ff:ff:ff:ff:ff:ff {"subnet":"10.109.86.0/255.255.254.0","ip":"10.109.86.199","hostname":"some","mac":"ff:ff:ff:ff:ff:ff"}
If there's no special meaning for this and no other flows currently rely on subnet information, I'd propose to remove that as it'll greatly simplify the smart-proxy operations. In the example above, the following happens in the background (from my understanding of the code, which could be wrong/incomplete):
1. dhcpd.conf is read for subnet info
2. dhcpd.lease is read for leases info
3. a map of subnets is created and populated with corresponding leases/reservations
4. a search of specified MAC is performed in that map
5. found record returned
If there is no meaning to subnet info returned with found record, then all of these 5 steps could be replaced with just one (my first example of omshell call at the beginning), which takes a lot less time.
Does this make sense to you?
Updated by Konstantin Orekhov over 11 years ago
- File large-dhcp.tgz large-dhcp.tgz added
I have generated a smaller dataset for your troubleshooting efforts, Ohad. It has 1582 subnets and ~41K leases.
Even though there's a noticeable improvement and majority of APIs have dropped their execution time ~50% (and now I'm even enable to do a global search, which was completely failing before), "create" is not one of them. Just out of curiosity I left it running over night and still did not finish!
$ time curl -3 -H "Accept:application/json" -k -X GET https://localhost:8443/dhcp > /dev/null
real 0m2.008s
user 0m0.020s
sys 0m0.000s
$ time curl -3 -H "Accept:application/json" -k -X GET https://localhost:8443/dhcp/169.254.1.0
{"reservations":[{"ip":"169.254.1.203","mac":"11:22:33:39:ac:40","hostname":"blah"}],"leases":[]}
real 0m29.741s
user 0m0.020s
sys 0m0.000s
$ time curl -3 -H "Accept:application/json" -k -X GET https://localhost:8443/dhcp/169.254.1.0/11:22:33:39:ac:40
{"ip":"169.254.1.203","mac":"11:22:33:39:ac:40","hostname":"blah","subnet":"169.254.1.0/255.255.255.0"}
real 0m29.665s
user 0m0.012s
sys 0m0.012s
$ time curl -3 -H "Accept:application/json" -k -X GET https://localhost:8443/dhcp/find/11:22:33:39:ac:40
{"ip":"169.254.1.203","mac":"11:22:33:39:ac:40","hostname":"blah","subnet":"169.254.1.0/255.255.255.0"}
real 0m31.368s
user 0m0.008s
sys 0m0.012s
$ time curl -3 -H "Accept:application/json" -k -X DELETE https://localhost:8443/dhcp/169.254.1.0/11:22:33:39:ac:40
real 0m28.294s
user 0m0.012s
sys 0m0.008s
$ time curl -3 -H "Accept:application/json" -k -X POST https://localhost:8443/dhcp/169.254.1.0 -d 'mac=11:22:33:39:ac:40' -d 'ip=169.254.1.203' -d 'hostname=blah'
^C
real 612m30.877s
user 0m0.420s
sys 0m0.664s
What this tells me is that there are some kind of loops somewhere that prevent things from operating normally and I really hope that this dataset will help you with identifying those. Please, keep me posted on a progress and let me know if you need more fake stuff generated for you - I can bump up a number of subnets if needed.
Thanks!
Updated by Konstantin Orekhov about 11 years ago
Any thoughts on this, folks? I was hoping providing a sample with large dataset would trigger some updates, but there's nothing in 19 days now...
Thanks!
Updated by Lukas Zapletal about 11 years ago
Konstantin,
this is great feedback, thanks for all the measurements. Out of my curiosity, what Ruby version are you running smart-proxy on?
The config/lease loading definitely nees more love, parsing code is a bit ugly and all in-memory. Also, there is a possibility of creating Ruby "pickled" version of both files and checking timestamp every request which could improve parsing when no changes are done. We are discussing possibility of putting a database to smart proxy which could open other possibilities too.
Updated by Konstantin Orekhov about 11 years ago
I use ruby 1.8.7.
Thanks for considering the changes. What is the priority you think this issue is at?
Updated by Ohad Levy about 11 years ago
- Description updated (diff)
another thought of improving the performance is to use inotify, that means, keep the structure in memory until the lease/conf file changes?
this way we won't need to reparse it everytime..
Updated by Konstantin Orekhov about 11 years ago
Yes, I completely agree that parsing both config and lease files on every request is something that has to go first. Agree on the timestamp checking on the config files as well, but on leases - it still could be changing too rapidly (timestamp-wise, not necessarily new-data-wise) in large environments.
I really think you should consider using omshell over re-reading/keeping-state of the lease file. Since omshell already provides a lot of the same checks implemented by smart-proxy (like duplicate entries, etc.), removing that duplication from your ruby code and relying more on omshell (since it is much faster in dealing with lease file) is the way to go here.
Thanks!
Updated by Ohad Levy almost 10 years ago
- Related to Feature #8210: Implemented caching for smart-proxy puppet classes added
Updated by Ohad Levy almost 10 years ago
- Related to deleted (Feature #8210: Implemented caching for smart-proxy puppet classes)
Updated by Andrew Cooper almost 9 years ago
I realize this is an old ticket but figured I would start here. We are seeing much higher load on our dhcp server / foreman smart-proxy after upgrading to 1.9.2 from 1.8. It would appear that every puppet run is now calling into the dhcp proxy with a call similar to "/dhcp/subnet/IP". Looking at debug logging of the smart-proxy, it is enumerating all IP addresses in all subnets before returning results. We are seeing calls take 15 to 20 seconds each. Should it be enumerating all leases, is the subnet-based lookup not coded correctly to only look in the specific subnet?
Our dhcp infrastructure is not overly large (8 24bit subnets in foreman, approx 70% used)
I can provide additional information or start a new ticket if needed.
Thanks,
Andrew
Updated by Anonymous almost 9 years ago
- Related to Bug #12392: 100% cpu usage on foreman-proxy DHCP calls added
Updated by Konstantin Orekhov almost 9 years ago
Dmitri, do the changes from http://projects.theforeman.org/issues/11866 help here at all? Or are they even supposed to?
Updated by Anonymous almost 9 years ago
It's an improvement, albeit a small one. The biggest improvement will come when we stop parsing the lease file on every request, something I'm planning on working on once https://github.com/theforeman/smart-proxy/pull/312 gets merged.
Updated by Konstantin Orekhov almost 9 years ago
Great! One thing though - if you read through my notes from 2 years ago, parsing a lease file is not the only problem (a). Parsing all of config files is another issue IMHO (b).
To fix a problem (a), I believe relying more on features of OMAPI itself is the way to go as Foreman smart-proxy does not really need to duplicate the safeguards (like checking for duplicate records, etc.) already implemented by OMAPI. In fact, OMAPI does these checks much-much faster than smart-proxy, so I think this is where the biggest benefit will come from.
However, OMAPI will not solve problem (b) as there's no way to get subnet info through OMAPI :(
To get a list of subnets, parsing of config files would still have to happen for ISC, but it should be made more intelligent and cache things and use that unless the changes detected.
Does this make sense?
Updated by Anonymous almost 9 years ago
Yeah, I was thinking I can use omapi for all calls except 'unused_ip' one, which will require parsing of the lease file. dhcpd config file doesn't change nearly as often as the lease file, using a cache there will work.
Updated by Konstantin Orekhov almost 9 years ago
What will prevent you from using OMAPI instead of parsing the whole lease file to make sure the IP is not given away?
But regardless of the implementation, I'm SO glad this finally gets some traction! Thank you! Any ideas on the timeframe? :)
Updated by Anonymous almost 9 years ago
I need to find an ip that might be useable first. The choices I have atm is parsing the lease file or pinging the whole subnet. I can start once https://github.com/theforeman/smart-proxy/pull/312 is merged. Any help with reviewing and/or testing of that PR would be greatly appreciated!
Updated by Konstantin Orekhov almost 9 years ago
I see. Well, I think it is to always a good idea to ping an IP to be suggested to make sure it is not live on the network outside of DHCP knowledge. This will help prevent possible duplicates.
As for review - I'm afraid I personally can't help you with that as I'm not a reviewer or even a proper developer for that matter :) I just run Foreman at scale, so sometimes running into issues that most people just don't see.
Updated by The Foreman Bot over 8 years ago
- Status changed from New to Ready For Testing
- Assignee set to Anonymous
- Pull request https://github.com/theforeman/smart-proxy/pull/409 added
Updated by Anonymous over 8 years ago
- Related to Bug #1090: When editing DHCP records, ISC backend times out if the number of subnets is large. added
Updated by Konstantin Orekhov over 8 years ago
Dmitri, what do I need to do to try this out as well in my environment?
I'm running 1.11.1 and 1.11.2 now - is your PR applicable to these versions?
Thanks!
Updated by Anonymous over 8 years ago
The PR isn't going to work with 1.11 branch. The easiest thing to do atm is to use a source install from the "develop" branch. Alternatively, you can install a nightly foreman-proxy build (get it here: http://yum.theforeman.org/nightly) and apply this PR to it.
Updated by Dominic Cleal about 8 years ago
- Translation missing: en.field_release set to 160
Updated by The Foreman Bot about 8 years ago
- Pull request https://github.com/theforeman/foreman-packaging/pull/1250 added
Updated by Anonymous about 8 years ago
- Status changed from Ready For Testing to Closed
- % Done changed from 0 to 100
Applied in changeset 7bd71b5efd38b609f5acf80bf3b5c899b3bd7e1c.
Updated by Dominic Cleal about 8 years ago
- Related to Bug #16021: inotify queue overflow halts ISC lease file monitoring added
Updated by Dominic Cleal almost 8 years ago
- Related to Bug #17301: ISC DHCP known reservations/leases not updated over NFS added
Updated by Konstantin Orekhov almost 8 years ago
Hello, guys!
Thanks for working on this, but I still have any issue even after upgrading to 1.13.1. After the upgrade and modifications to dhcp.yml, restart of smart-proxy took 4 hours (based on the debug messages in the log), then went quiet for another 4 hours and finally self-disabled dhcp_isc module. Please see proxy.log here - https://gist.github.com/korekhov/4630630292626d25f192e03d13d33651
I do have a large dataset, and there are a lot of networks I'd love to exclude from smart-proxy as it would not manage those for sure - for example, /30s and /31s. Just to give you an idea on how many subnet sizes I have in one of the locations (and there are several like this):
746 255.255.252.0
817 255.255.254.0
909 255.255.255.0
101 255.255.255.128
168 255.255.255.192
5454 255.255.255.252
8328 255.255.255.254
As you can see, if I could blacklist \*/255.255.255.254 and \*/255.255.255.252 (or whitelist !\*/255.255.255.254 and !\*/255.255.255.252), smart-proxy would have to deal with dramatically smaller number of networks.
(I'm escaping stars above as redmine converts them into bold font)
In either case, can we re-open this issue?
Thanks!
Updated by Dominic Cleal almost 8 years ago
Please open a new issue for any further problems, this one is linked to already shipped changes so won't be re-opened.
Updated by Anonymous almost 8 years ago
- Related to Bug #17373: ISC dhcp provider is unable to handle very big networks added