I had a question last week, and it had me going for a while. The question was if “das.perHostConcurrentFailoversLimit” could be used to lower the hit on storage during a boot storm. By default this advanced option is set to 32. Meaning that a max of 32 VMs will be restarted by HA on a single host. The question was if lowering this value to for instance 16 would help reducing the stress on storage when multiple hosts would fail, or for instance in a blade environment when a chassis would fail.
At first you would probably say “Yes of course it will”. Having only 16 restarts concurrently vs 32 should cut the stress in half… Well not exactly. The point here is that this setting is:
- A per host setting and not cluster wide
- Addressing power on attempts
So what is the problem with that exactly? Well in the case of the per host setting, if you have a 32 node cluster and 8 would fail, there would still be a max of 384 VMs power on attempts concurrently. (32 – 8 failed host) * 16 VMs max restart per host. Yes it is a lot better than 768, but still a lot of VMs hitting your storage.
But more importantly, we are talking power-on attempts here! A power-on attempt does not equal the boot process of the virtual machine! It is just the initial process that flips the switch of the VM from “off” to “on”, check vCenter when you power on a VM, you will see the task as completed during the boot process of your VM. Reducing this number will reduce the stress hostd, but that is about it. In other words, if you lower it to 16 you will have less power-on attempts concurrently, but they will be handled fast by HOSTD and before you know it 16 new power-on attempts will be done, and near simultaneous!
The only way you can really limit the hit on storage and virtual machines sharing this storage would be by enabling Storage IO Control. SIOC will ensure that all VMs who are in need of storage resources will get it in a fair manner. The other option is to ensure that you are not overloading your datastores with a massive amount of VMs and not the IOPS to back the boot storm process up. I guess there is no real need to be overly concerned here though… How often does it happen that 50% of your environment fails? If it does, are you worried about that 15 minute performance hit, or worried about those 50% of the VMs being down?