I had question on my HA Deepdive which I thought was worth answering in an article:
How does the active primary node decide where to restart failed VMs? Does it use a round-robin algorithm for selecting a host to start the VMs in restart priority order? What happens if the remaining nodes are imbalanced, especially without DRS enabled; are the nodes that have no spare capacity skipped? Or, does the active primary node restart VMs on the least busy host first, then the next busy host, etc?
Also, if VMs have no reservation for CPU or memory set, how does HA decide the number of VMs to restart on any one node? Is it possible that HA will restart too many VMs on one node so that performance is extremely poor until DRS move some VMs to other nodes?
In the past HA(pre 4.1) would consider the utilization of the Hosts and go through a check for every VM that needs to failover. It would fail the VM over the host with the most amount of available resources. Now from a “latency” perspective that is not the best approach as you can imagine. With latency meaning the time it takes to restart the VMs and the delay caused by hostd. Now type of delay can be cause by hostd? Well lets assume you have 1 host which is not doing a lot, this host would be the host that is selected for most failovers. Having 10 VMs (or more) starting in parallel will beat hostd severely.
So does HA use DRS to select which host to use for the restart? No it won’t, DRS happens on a vCenter level and HA happens on a host level…. But more things have changed. As of vSphere 4.1 virtual machines will be evenly distributed across hosts to lighten the load on the hostd service and to get quicker power-on results. HA then relies on DRS to redistribute the load later if required. This improvement results in faster restarts of the virtual machines and less stress on the ESX hosts.
So what if you are not using DRS? To put it bluntly, make sure you manual balance your environment to ensure HA doesn’t “overload” a single host… that is the only thing you can do for now. (by the way, all of this is included in the HA and DRS tech deepdive :-))
