Although this is a corner case scenario I did wanted to discuss it to make sure people are aware of this change. Prior to vSphere 5.0 Update 1 a virtual machine would be restarted by HA when the master had detected that the state of the virtual machine had changed compared to the “protectedlist” file. In other words, a master would filter the VMs it thinks had failed before trying to restart any. Prior to Update 1, a master used the protection state it read from the protectedlist. If the master did not know the on-disk protection state for the VM, the master did not try to restart it. Keep in mind that only one master can open the protectedList file in exclusive mode.
In Update 1 this logic has slightly changed. HA can know retrieve the state information from either the protectionlist stored on the datastore or from vCenter Server. So now multiple masters could try to restart a VM. If one of those restarts would fail, for instance because a “partition” does not have sufficient resources, the master in the other partition might be able to restart it. Although these scenarios are highly unlikely, this behavior change was introduced as a safety net!
** Disclaimer: This article contains references to the words master and/or slave. I recognize these as exclusionary words. The words are used in this article for consistency because it’s currently the words that appear in the software, in the UI, and in the log files. When the software is updated to remove the words, this article will be updated to be in alignment. **
Sketch says
Time to update the book!
Andreas Peetz says
Hi Duncan,
thanks for pointing this out. The title of your post reminded me of another “change in the restart behavior with 5.0 Update 1”. Not related to HA, but to the VM autostart functionality for stand-alone hosts. It looks like this is broken when using the free license:
http://v-front.blogspot.de/2012/03/esxi-50-update-1-breaks-vm-autostarts.html
Do you know if VMware is aware of this bug (I hope it is only a bug and not by intention?!) and working on a solution?
– Andreas
Duncan Epping says
Yes VMware is aware and working on it: http://bit.ly/GT2Gbn
SteveB says
Hello Duncan great site!
Sorry if this out of topic can you please pass the word to VMware and who ever publishes the recent Technical White Paper documentation to stop using the 2 inch left margin of white space for their documentation in pdf files? Reading it on any e-reader you have to zoom in more often to read the papers. What is the reason for the big margin of space on the left side?
Keep up the great work. Steve B.
Martin says
Hi,
In case of partition scenario, would it be possible for both masters to successfully startup the vm
Duncan says
No, as soon as the first master powers on the VM the files will be locked.
xzearik says
Hi,
I would like to know the exact and vSphere defined time of VMware HA in following scenario.
02 Host in a HA cluster running 02 VM’s with default restart priority. When power chord removes from one server what should be the standard response time?
– How many seconds HA need to conclude that the Host is down?
– How many second both VM will take to move from lost host to the live host?
– What is the VMware guaranteed total time a VM take to actually powered on at surviving host in first attempt? (ONLY VM powered ON not entering in the OS loading).
xzearik
Duncan Epping says
That is not an easy question to answer as it depends whether the host is a master or a slave. Even then it will depend on various variables so there is no hard guaranteed time. When a slave fails HA takes around 18-20 seconds to conclude it has failed and queue the power-on attempts.
For a master this takes slightly longer, it would be about 35-40 seconds as a new master will need to be elected etc.
Hope that helps.
xzearik says
Thanks for the clarification, however i would like to know if there is any parameter available in HA config which defines the time and if it is changeable with the admin defined value.
Actually the purpose of this inquiry to conclude that how much minimum & maximum time requires to have VM powered on in node file situation.
Appreciate if you consider both cases while answering the inquiry.
Total time to wait before see a VM moved and powered on surviving node IF a SLAVE Failed?
Total time to wait before see a VM moved and powered on surviving node IF a Master Failed?
Duncan Epping says
Not that I can share unfortunately. If 40 seconds before restarts is too long I think your best option is either vSphere FT or using an application level clustering service.
xzearik says
Alright.. Thanks for the useful tip.
tushar says
Does HA restarts (proper reboot) or resets (unexpected reboot) a VM during Host failure/down? If it performs proper reboot then in what case scenarios does unexpected reboot is detected by guest os during HA?