Last week I did an article about Datastore Heartbeating and the prevention of the Isolation Response being triggered. Apparently this was an eye-opener for some and I received a whole bunch of follow up questions through twitter and email. I figured it might be good to write-up my recommendations around the Isolation Response. Now I would like to stress that these are my recommendations based on my understanding of the product, not based on my understanding of your environment or SLA. When applying these recommendations always validate them against your requirements and constraints. Another thing I want to point out is that most of these details are part of our book, pick it up… the e-book is cheap.
First of all, I want to explain Isolation Response…
Isolation Response is the action HA triggers, per VM, when it is network isolated from the rest of your cluster. Now note the “per VM”, so a host will trigger the configured isolation response per VM, which could be either “power off” or “shutdown”. However before it will trigger the isolation response, and this is new in 5.0, the host will first validate if a master owns the datastore on which the VMs configuration files are stored. If that is not the case then the host will not trigger the isolation response.
Now lets assume for a second that the host has been network isolated but a master doesn’t own the datastore on which the VMs config files are stored, what happens? Nothing happens. Isolation response will not be triggered as the host knows that there is no master which can restart these VMs, in other words there is no point in powering down a VM when it cannot power it on. The host will of course periodically check if the datastore is claimed by a master.
There’s also a scenario where the complete datastore could be unavailable, in the case of a full network isolation and NFS / iSCSI backed storage for instance. In this scenario the host will power off the VM when it has detected another VM has acquired the lock on the VMDK. It will do this to prevent a so-called split brain scenario, as you don’t want to end up with two instances of your VM running in your environment. Keep in mind that in order to detect this lock the “isolation” on the storage layer needs to be resolved. It can only detect this when it has access to the datastore.
I guess there’s at least a couple of you thinking but what about the scenario where a master is network isolated? Well in that case the master will drop responsibility for those VMs and this will allow the newly elected master to claim them and take action if required.
I hope this clarifies things.
Now lets talk configuration settings. As part of the Isolation Response mechanism there are three ways HA could respond to a network isolation:
- Leave Powered On – no response at all, leave the VMs powered on when there’s a network isolation
- Shutdown VM – guest initiated shutdown, clean shutdown
- Power Off VM – hard stop, equivalent to power cord being pulled out
When to use “Leave Powered On”
This is the default option and more than likely the one that fits your organization best as it will work in most scenarios. When you have a Network Isolation event but retain access to your datastores HA will not respond and your virtual machines will keep running. If both your Network and Storage environment are isolated then HA will recognize this and power-off the VMs when it recognizes the lock on the VMDKs of the VMs have been acquired by other VMs to avoid a split brain scenario as explained above. Please note that in order to recognize the lock has been acquired by another host the “isolated” host will need to be able to access the device again. (The power-off won’t happen before the storage has returned!)
When to use “Shutdown VM”
It is recommend to use this option if it is likely that a host will retain access to the VM datastores when it becomes isolated and you wish HA to restart a VM when the isolation occurs. In this scenario, using shutdown allows the guest OS to shutdown in an orderly manner. Further, since datastore connectivity is likely retained during the isolation, it is unlikely that HA will shut down the VM unless there is a master available to restart it. Note that there is a time out period of 5 minutes by default. If the VM has not been gracefully shutdown after 5 minutes a “Power Off” will be initiated.
When to use “Power Off VM”
It is recommend to use this option if it is likely that a host will lose access to the VM datastores when it becomes isolated and you want HA to immediately restart a VM when this condition occurs. This is a hard stop in contrary to “Shutdown VM” which is a guest initiated shutdown and could take up to 5 minutes.
As stated, Leave Powered On is the default and fits most organizations as it prevents unnecessary responses to a Network Isolation but still takes action when the connection to your storage environment is lost at the same time.
** Disclaimer: This article contains references to the words master and/or slave. I recognize these as exclusionary words. The words are used in this article for consistency because it’s currently the words that appear in the software, in the UI, and in the log files. When the software is updated to remove the words, this article will be updated to be in alignment. **
Jason Langdon says
Is the ebook available for anything other then Kindle?
Duncan Epping says
No.
Steve says
@Jason, technically it is also available to iPad’s via the free Kindle app. Just FYI.
Thanks again Duncan and Frank for another invaluable resource.
Johan says
I do not get it.
On the other blogpost you wrote “Let me repeat this, Datastore Heartbeats do not prevent an isolation event from occurring.”
In this post you write “There’s also a scenario where the complete datastore could be unavailable, in the case of a full network isolation and NFS / iSCSI backed storage for instance. In this scenario the host will power off the VM when it has detected another VM has acquired the lock on the VMDK. It will do this to prevent a so-called split brain scenario, as you don’t want to end up with two instances”
Which means it is kind of doing the same thing as you would do by configure das.isolationaddress and add your storage ip here?
Second question: “. If both your Network and Storage environment is isolated then HA will recognize this and power-off the VMs when it recognizes the lock on the VMDKs of the VMs have been acquired by other VMs to avoid a split brain scenario as explained above.”
How is it possible for an host without storage to noticed that another host locked the vmdk?
Craig Risinger says
It’s after the isolated host becomes unisolated again, or at least regains datastore connectivity. Say host#1 was totally isolated including from storage. The VM process continues to run on host #1, but the VM is off the network like the host. Also the host can’t update the .vmdk files. Surprisingly, a VM can often continue running awhile even without access to its virtual disk.
The other hosts detect host#1 being down. Say host #2 restarts the VM. Because host #1 can’t see the datastore, it doesn’t refresh its locks, and host #2 gets locks on the VM files.
Later, host #1 comes back up (let’s say it gets back all network connectivity as well as datastore). Host #1 tries to refresh locks on the VM files, finds it can’t because host #2 is actively holding locks. Then host#1 shuts down its phantom VM process.
If host #1 did NOT shut down the phantom VM, then you’d have two VMs with identical IP and MAC addresses on the network. So all access to the VM would be spotty. E.g. it would bounce around in VC.
Duncan Epping says
The host will know another host has locked the VMDK when it tries to re-acquire the lock and it is declined.
Datastore Heartbeats are not related to this. It might sound similar conceptually because it is based on the same locking mechanism but it is different.
Johan says
But still if the host can’t access the storage it can’t noticed the another ESX have restarted the VM. I do understand it if you would write something like “when the host can access it’s storage again it will power of the VM.
So will datastore heartbeat prevent an isolation response or NOT?
Johan says
According to the book the ESXi will detect that the lock on the vmdk has been lost and power it off to avoid the vm running with out disk. According to the blog post it will power it of when another vm acquired the lock.
Duncan Epping says
okay lets try again:
1) HA datastore heartbeating has got nothing to do with it.
2) A host can only re-acquire a lock when it has access again to the storage of course. So as soon as it has got access, it will try to re-acquire the lock, it will notice it cannot re-acquire it and power it off.
3) HA datastore heartbeating has got nothing to do with it.
Hope that helps,
Johan says
Thank you!
One thing that comes in mind though is
An ESX got two NIC’s connected to the storage network (iSCSI). They are connected to different switches. Let’s say one NIC fails and then the switch connected to the other NIC fails. (We lose our storage)
Will the VM’s running on that host power off even if never got my storage back? (the lock is gone because we can’t update the scsi lock?)
From what I get out of it no? As it can’t validate that a master owns the datastore.
Duncan Epping says
No it will not power off the VMs, which also makes sense as there’s a huge change the master has no storage access either. On top of that there’s still the network heartbeat which prevents this from happening,
Forbsy says
When to use “Leave Powered On”
This is the default option and more than likely the one that fits your organization best as it will work in most scenarios. When you have a Network Isolation event but retain access to your datastores HA will not respond and your virtual machines will keep running. If both your Network and Storage environment are isolated then HA will recognize this and power-off the VMs when it recognizes the lock on the VMDKs of the VMs have been acquired by other VMs to avoid a split brain scenario as explained above. Please note that in order to recognize the lock has been acquired by another host the “isolated” host will need to be able to access the device again. (The power-off won’t happen before the storage has returned!)
This is a little confusing. Are you saying that the vm will remain in a powered on state on another host until the isolated host comes back up and has access to the datastore – at that point (which could be 3-5minutes) it will suddenly power off the vm on the other host and then restart it? What is the state of the vm during that 3-5 minutes? Is it accessible? It sounds like it looks like it’s still powered on but since the isolated host still owns the lock what is actually going on with that vm during this gap of time?
Duncan says
I am saying the following:
There are two VMs powered on. One of them has no access to disks and its host is isolated. The VM which was powered on has access to the disk and lives on the “good” host.
Now depending on the type of isolation you might or might not have two VMs actively on the network. So if you have a single vSwitch more than likely the full network is isolated and it is not a problem. If you have multiple vSwitched it could be the case that only the management network is isolated and that means you could have two VMs with the same name on the network.
Jeff says
Hi i have 2 virtual cisco apps on 2 seperate servers that are HBA connected to a vnx datastore – when i lose the connections to the datastore my LUN becomes inactive on the corresponding host . Is there a way I can set up an alert so that if the LUN becomes inactive then it will close down the corresponding vmware host on that server. As soon as that host will shutdown then my other vmware host on the other server will become active. At the moment because the vmware host is still up (even though its datastore is not) its preventing the other machine to become active – the redundancy is done in the cisco app rather than at vmware level. The HBA ‘s are dual connected with redundant switches so its very unlikely that the datastore will become disconnected but we have to test if it could happen.
I have tried putting an alert on the datastore with a trigger of ‘unavailable to all hosts’ and an action of command ‘shutdown -h now’ but does not work. ..TIA, J
Duncan Epping says
Alerts triggering an action should normally work in this case… Not sure why it doesn’t in your case to be honest.
kam says
Im new to Vmware, on your post do you mean datastorage as a shared storage? iscsi or nas
Does Vmware HA requires Network storage to work,(What i mean of work = automatically restarts vm to another host)
Duncan Epping says
vSphere HA requires shared storage: iSCSI / NFS / FC / FCoE, what ever applies here…