I’m going to start with a quote from Mike’s article “What I learned today…“:
Split brain is HA situation where an ESX host becomes “orphaned” from the rest of the cluster because its primary service console network has failed. As you might know the COS network is used in the process of checking if an ESX host has suffered an untimely demise. If you fail to protect the COS network by giving vSwitch0 two NICs or by adding a 2nd COS network to say your VMotion switch, under-desired consequences can occour. Anyway, the time for detecting split brain used to be 15 seconds, for some reason this has changed to 12 seconds. I’m not 100% why, or if in fact the underlying value has changed – or that VMware has merely corrected its own documentation. You see its possible to get split brain in Vi3.5 happening if the network goes down for more than 12 seconds, but comes back up on the 13th, 14th or 15th second. I guess I will have to do some research on this one. Of course, the duration can be changed – and split brain is trivial matter if you take the neccessary network redundency steps…
I thought this issue was something that was common knowledge but if Mike doesn’t know about it my guess is that most of you don’t know about this. Before we dive into Mike’s article, technically this is not a split brain, it is an “orphaned vm” but not a scenario where the disk files and the in memory VM are split between hosts.
Before we start this setting is key in Mike’s example:
das.failuredetectiontime = This is the time period when a host has received no heartbeats from another host, that it waits before declaring the other host dead.
The default value is 15 seconds. In other words the host will be declared dead on the fifteenth second and a restart will be initiated by one of the primary hosts.
For now let’s assume the isolation response is “power off”. These VMs can only be restarted if the current VMs have been powered off. Here’s the clue, the “power off”(isolation response) will be initiated by the isolated host 2 seconds before the das.failuredetectiontime.
Does this mean that you can end up with your VMs being down and HA not restarting them?
Yes, when the heartbeat returns between the 13th and 15th second shutdown could already have been initiated. The restart however will not be initiated because the heartbeat indicates that the host is not isolated.
How can you avoid this?
Pick “Leave VM powered on” as an isolation response. Increasing the das.failuredetectiontime will also decrease the chances of running in to issues like these.
Did this change?
No, it’s been like this since it has been introduced.
Mike Laverick says
Yep, I’m with you on that… What was new to me was the fact some of the VMware Courseware now states 12 second not 15. Does that mean the das.failuredetectiontime has changed OR that the VMware courseware is trying to be more technically accurate – to take into account the 13th, 14th and 15th second. To be honest I find this a largely academic debate – because one should always have the redundancy in place to aviod split brain occurring in the first instance…
Duncan Epping says
The courseware has been changed indeed. And I verified it with an HA developer and he told me it was the 13th second indeed.