I had a question today around the restart of VMs with no access to storage by HA. The question was if HA would try to restart the VM and time out after 5 times. With the follow up question, if HA would try again when the storage would return for duty.
By default HA will try to restart a VM up to 5 times in roughly 30 minutes. If the master does not exceed it will stop trying. On top of that HA manages a “compatibility list”. This list will contain the details around which VM can be restarted and where. In other words; which hosts have access to the datastores and network portgroup required for this VM to successfully power-on. Now if for whatever reason there are no compatible hosts available for the restart then HA will not try to restart the VM.
But what if the problem is resolved? As soon as the problem is resolved, and reported as such, the compatibility list will be updated. When the list is updated HA will continue with the restarts again.
It might also be good to know that if for whatever reason the master fails, a new master will continue trying to restart the VM. It will start with 5 new attempts and not take the number of restart attempts that the previous master did into account.
** Disclaimer: This article contains references to the words master and/or slave. I recognize these as exclusionary words. The words are used in this article for consistency because it’s currently the words that appear in the software, in the UI, and in the log files. When the software is updated to remove the words, this article will be updated to be in alignment. **
Mike says
Hi Duncan,
I was recently testing this in my lab. I have two hosts using software iscsi. On one host, I pulled the network cable to the physical nic that was assigned to my storage. I only had 1 VM powered on at that time, and it never restarted. I’m not sure what I did wrong
NiTRo says
Hi Duncan, it sounds good but it often appends to me in nfs environement that when HA tries to restart the vms when the storage is out, th vm state switch to orphaned or even invalid and then the chaos begins… It also appends with vm monitoring (still with nfs).
What do you think about that ?
Duncan Epping says
Is that with 4 or 5? I have never seen that behavior and will try to reproduce it in my lab next week if I have the time.
NiTRo says
It was 4.0 and 4.1 and we stop using vm monitoring because of that. We planned to script the vm monitoring state depending on storage type to work around that issue.
By the way, we also had a strange behavior when connectivity problems appends on nfs where sometimes the vmx file end up empty. I ask vmware support why the vmx would be rewrited during powercycle but i got a nice “nexenta is not supported, goodbye” so i never figured out.
Harry says
(EDITED) There’s absolutely no need for name calling. This is not the place for it! (EDITED)
Joe says
I’ve been disappointed with the lack of storage connectivity consideration when it comes to an HA event. We’ve had some switch and HBA issues in the past that have caused storage to only one host to go down, but HA will never restart the VMs on other hosts that are fine. I had numerous calls to VMware about this and I was told that having storage disappear isn’t considered a failure. The host has to totally go down. This is with 4 and 5.