Looking from the outside at the way HA in 5.0 behaves might not seem any different but it is. I will call out some changes with regards to how VM restarts are handled but would like to refer you to our book for my in-depth details. These are the things I want to point out:
- Restart priority changes
- Restart retry changes
- Isolation response and detection changes
Restart priority changes
First thing I want to point out is a change in the way the VMs are prioritized for restarts. I have listed the full order in which virtual machines will be restarted below:
- Agent virtual machines
- FT secondary virtual machines
- Virtual Machines configured with a restart priority of high,
- Virtual Machines configured with a medium restart priority
- Virtual Machines configured with a low restart priority
So what are these Agent VMs? Well these are VMs that provide a service like virus scanning or for instance edge services like vShield can provide. FT Secondary virtual machines make sense I guess and so does the rest of the list. Keep in mind though that if the restart fails of one of them HA will continue restarting the remaining virtual machines.
Restart retry changes
I explained how the restart retries worked for 4.1 in the past and basically the total number of restart tries would be 6 by default, this was 1 initial restart and 5 retries as defined with “das.maxvmrestartcount”. With 5.0 this behavior has changed and the max amount of restart counts is 5 in total. Although it might seem like a minor change, it is important to realize. The time line has also slightly change and this is what it looks like with 5.0:
- T0 – Initial Restart
- T2m – Restart retry 1
- T6m – Restart retry 2
- T14m – Restart retry 3
- T30m – Restart retry 4
The “m” stands for minutes and it should be noted that the next retry will happen “X” after the master has detected the restart has failed. So in the case of T0 and T2m it could actually be that the retry happens after 2 minutes and 10 seconds.
Isolation response and detection changes
Another major change was part of the Isolation Response and Isolation Detection mechanism. Again from the outside it looks like not much has changed but actually a lot has and I will try to keep it simple and explain what has and why this is important to realize. First thing is the deprecation of “das.failuredetectiontime”. I know many of you used this advanced setting to tweak when the host would trigger the isolation response, that is no longer possible and needed to be honest. If you’ve closely read my other articles you hopefully picked up on the datastore heartbeating part already which is one reason for not needing this anymore. The other reason is that before the isolation response is triggered the host will actually validate if virtual machines can be restarted and if it isn’t an all out network outage. Most of us have been there at some point, a network admin decides to upgrade the switches and all hosts trigger the isolation response at the same time… well that won’t happen anymore! One thing that has changed because of that is the time it takes before a restart will be initiated. I have listed the timeline for both the isolation of a master and the failure of a slave below:
Isolation of a slave
- T0 – Isolation of the host (slave)
- T10s – Slave enters “election state”
- T25s – Slave elects itself as master
- T25s – Slave pings “isolation addresses”
- T30s – Slave declares itself isolated and “triggers” isolation response
Isolation of a master
- T0 – Isolation of the host (master)
- T0 – Master pings “isolation addresses”
- T5s – Master declares itself isolated and “triggers” isolation response
After the completion of this sequence, the (new) master will learn the host was isolated and will restart virtual machines based on the information provided by the slave.
As shown there is a clear difference and of course the reason for it being is the fact that when the master isolates there is no need to trigger an election process which will be needed in the case of a slave to detect if it is isolated or partitioned. One again, before the isolation response is triggered the host will validate if a host will be capable of restarting the virtual machines… no need to incur downtime when it is unnecessary.
I would suggest reading this article twice to fully absorb all the minor detailed changes. The book contains more details than this so if you are interested pick it up.
Andrey Vakhitov says
Hello. In vSphere 4.1 if we have two host failed, firstly all virtual machines from First host are restarted within its priorities. And thereafter virtual machines from Second host are restarted.
Does it change in 5.0?
Duncan Epping says
The master will wait for 10 secondsto aggregate all failure and then determine restarts based on that.
Greg says
I assume the protectedlist file is simply a hardcoded list of all the virtual machines that HA thinks it is protecting at that time?
Greg says
Andrey, I am not sure about that but I would think it would be determined by the restart priorities of the virtual machines and if there were suitable hosts left to power them on. I seem to recall reading that there is a maximum of 32 concurrent power on operations for HA. Duncan may have more info though on that.
Duncan Epping says
Keep in mind: 32 max concurrent PER host 🙂
Bilal Hashmi says
Are appliances and other services like vMA also considered Agent VMs?
SERGIO says
Congratulations, Great serie of articles about vsphere 5 HA. Regarding this, a network admin decides to upgrade the switches and all hosts trigger the isolation response at the same time… well that won’t happen anymore!
If you are using iscsi storage, this scenario can occur, because you loose connectivity between hosts and datastore heartbeating too.
Am i right?
I know that in this case, it is normal to trigger isolation response as you don´t have neither network nor storage, but i want to confirm it. thanks.
AB says
How host would know that this is agent server and they should be restarted first..
Duncan Epping says
A vendor would need to tag them as an agent VM as explained in this PDF: http://pubs.vmware.com/vsphere-50/topic/com.vmware.ICbase/PDF/vsphere-ext-solutions-50.pdf
Jim Moyle says
Duncan,
If you extend the number of retries beyond five, how many minutes are there between the retries after the 30 minutes?