I had a question last week, and it had me going for a while. The question was if “das.perHostConcurrentFailoversLimit” could be used to lower the hit on storage during a boot storm. By default this advanced option is set to 32. Meaning that a max of 32 VMs will be restarted by HA on a single host. The question was if lowering this value to for instance 16 would help reducing the stress on storage when multiple hosts would fail, or for instance in a blade environment when a chassis would fail.
At first you would probably say “Yes of course it will”. Having only 16 restarts concurrently vs 32 should cut the stress in half… Well not exactly. The point here is that this setting is:
- A per host setting and not cluster wide
- Addressing power on attempts
So what is the problem with that exactly? Well in the case of the per host setting, if you have a 32 node cluster and 8 would fail, there would still be a max of 384 VMs power on attempts concurrently. (32 – 8 failed host) * 16 VMs max restart per host. Yes it is a lot better than 768, but still a lot of VMs hitting your storage.
But more importantly, we are talking power-on attempts here! A power-on attempt does not equal the boot process of the virtual machine! It is just the initial process that flips the switch of the VM from “off” to “on”, check vCenter when you power on a VM, you will see the task as completed during the boot process of your VM. Reducing this number will reduce the stress hostd, but that is about it. In other words, if you lower it to 16 you will have less power-on attempts concurrently, but they will be handled fast by HOSTD and before you know it 16 new power-on attempts will be done, and near simultaneous!
The only way you can really limit the hit on storage and virtual machines sharing this storage would be by enabling Storage IO Control. SIOC will ensure that all VMs who are in need of storage resources will get it in a fair manner. The other option is to ensure that you are not overloading your datastores with a massive amount of VMs and not the IOPS to back the boot storm process up. I guess there is no real need to be overly concerned here though… How often does it happen that 50% of your environment fails? If it does, are you worried about that 15 minute performance hit, or worried about those 50% of the VMs being down?
Michael Webster says
Also no matter what the concurrent power on ratio is won’t stop the VM’s that are powering on from using the entire available device queue when booting. I’d much rather keep with the default settings and have the VM’s powered back on as quick as possible and live with a brief performance impact. I would also limit the number of VM’s per datastore to limit impact on a per datastore basis. As you suggested I would also have SIOC turned on to even out the load distribution and improve fairness.
Josh Odgers says
I tend to agree with michael. The majority of my customers would opt for an impact to storage performance as a trade off to get the VMs back online asap. I would suggest any vmware and more importaintly storage design would be incomplete without taking this and similar scenarios into consideration. Good topic duncan.
Marko says
Duncan, imho your guess is wrong for some installations. Maybe we can call it a product lifecycle issue. There are more and more VMware systems out there which run industrial applications. Not all of the admins watch carefully if their storage has still enough ressources (I/O !) and so on. If some hosts fail the complete system crashs and a 5 minute outage of the servers causes an 50 minute outage of production.
Duncan Epping says
@Marko: So how do you mitigate that? I suspect that when 50% of an environment is down, chances are big your whole production stops. But I’m not talking about corner case scenarios here of course. I am talking general deployments.
Vaughn Stewart says
Duncan,
Your insight and knowledge is spot on. The IO load on a storage array to simultaneously boot a large number of VMs is well understood with VMware View (virtual desktops); however many don’t apply this understanding to virtual server events such as HA restart or SRM failover.
This is an example where the intelligence of caching in the storage array can really make a difference. The dedupe intelligence in a NetApp allows the array to cache the binaries of an operating system and application once regardless of how many VMs are being started. This technology eliminates the redundancy in the IO requests, thus a significant reduction in the I/O on the disk subsystem which results in a greater ability of the array to support the ‘boot storm’ without impacting overall performance.
This capability is unique to NetApp. For more information on this capability your readers can read the following two-part post:
http://virtualstorageguy.com/2010/03/16/transparent-storage-cache-sharing-part-1-an-introduction/
http://virtualstorageguy.com/2010/03/22/transparent-storage-cache-sharing-part-2-more-use-cases/
Thanks for addressing the topic,
@vStewed
Mihai says
Well, as others said, you can use Storage IO Control to give super priority to the really important VMs.
Also, as others pointed, this would be poor design anyway: if industrial production is so important why mix VMs on the same datastore/disks? Other bad things could happen too besides a HA event.
Vaughn Stewart says
Mihai,
You asked Why mix VMs on the same datastore/disks? Other bad things could happen too besides a HA event.”
Shared pools of storage allow the VI admins to provision virtual disks to VMs w/o having to engage a storage admin. This model reduces the operational model required for providing storage services to a VMware environment.
As for the HA boot storm issue. Wether one deployed VMs on a 1:1 VM to LUN ratio or as most do in a MANY:1 design, the IO load on the shared storage array (SAN or NAS) does not change. So one must deploy an array that can handle the load or limit the IO load at the hypervisor via SIOC.
Mihai says
@Stewart Yes, you have a good point with management. But I wasn’t suggesting necessarily a 1:1 VM to LUN ratio but maybe make a separate datastore dedicated for Tier 1 VMs on separate array disks or even controllers (depending on budget/needed performance assurance guarantee; you could even get away with just a separate LUN that has higher priority at the controller level) if you really want to be absolutely sure of their performance.
Markus says
@VAUGHN STEWART
The cache thing is a very interesting object, but I think this is a very special NetApp unique.
Dedup aware cache rocks!
forbsy says
NetApp does a nice job with dedupe aware cache, but it’s on the storage and it’s only read cache. I’m not knocking Flashcache or other storage cache like FAST, but I’m starting to position my customers to cache solutions closer to the server – if not within the server itself. By addressing cache at this layer you’re taking out extra hops to the storage, which means less latency. These caching solutions also optimize both read and write I/O – which is important in addressing not only boot/login storms but a greater majority of write operations, which are in most cases the greater majority.
Yug Krishna says
Hi Duncan,
I am leaving this question here as i did a lot of google and couldn’t find a definitive answer.
The questions is as follows:
“What is the background process involved when a Virtual Machine is powered-on on a ESX Host”?
Does it assigns the virtual hardware and then POST follows as normal or is there anything specific in vmware term which happens in the background process.
Please help me getting this answered.
Thanks and Regards,
-Yug Krishna