I had a question from one of my colleagues last week about the vSphere HA Isolation Response and IP Storage. His customer had an ISCSI storage infrastructure (applies to NFS also) and recently implemented a new vSphere environment. When one of the hosts was isolated virtual machines were restarted and users started reporting strange problems.
What happened was that the vSphere HA Isolation Response was configured to “Leave Powered On” and as both the Management Network and the iSCSI Network were isolated there was no “datastore heartbeating” and no “network heartbeating”. Because the datastores were unavailable the lock on the VMDKs expired (virtual disk files) and HA would be able to restart the VMs.
Now please note that HA/ESXi will power-off (or kill actually) the “ghosted VM” (the host which runs the VMs that has lost network connection) when it detects the locks cannot be re-acquired. It still means that the time between when the restart happens and the time when the isolation event is resolved potentially the IP Address and the Mac Address of the VM will pop up on the network. Of course this will only happen when your virtual machine network isn’t isolated, and as you can imagine this is not desired.
When you are running IP based storage, it is highly (!!) recommend to configure the isolation response to: power-off! For more details on configuring the isolation response please read this article which lists the best practices / recommendations.
Rickard Nobel says
How long time is expected for the host after being reconnected to the network until it knows that someone else has “taken” the VMDKs?
Could it be reasonable to change the Isolation Response back to Shutdown when only having IP storage and no FC?
Duncan Epping says
the “power off” is a matter of seconds. And yes you can select “shutdown”, but keep in mind that it could take up to 5 minutes before a VM is actually down in that case.
Rickard Nobel says
So the network confusion with multiple MAC and IP addresses is just a few seconds maximum?
And yes, the “shutdown” has 5 minutes default before hard power off, so if the network comes back again before that it wont help.
Duncan says
It would be from the time the second VM is powered on until the first VM is powered off. If “leave powered on” is selected this will be when it is detected that the lock cannot be reclaimed, that could take a whike….
Satinder Sharma says
Hi Duncan
How is the HA handled in case of NFS? Is the datastore heartbeat available to NFS datastores too or its VMFS only?
Thanks
Bilal Hashmi says
I guess in this case a power off/shut down isolation response would have been a bad thing.. The environment would have powered itself down. It’s not like the leave powered on worked out great either.. but in the end as long as we know what happened.. its great. Thanks for sharing this Duncan.
forbsy says
I’m still somewhat confused. Why would there be duplicate IP’s and MAC’s? What is your recommendation for IP based storage (NFS or iSCSI)?
forbsy says
What would happen if the host was isolated but the datastore was not – and the isolation response is ‘leave powered on’? It seems that the vm would continue to function as normal but what host would ‘own’ the vm at that point?
Duncan says
@forbsy: there would be duplicate IPs because both VMs would be on the network. If the Datastore is not isolated the VM cannot be restarted as the VMDK would be locked.
Duncan says
@SATINDER SHARMA: NFS is same concept.
udubplate says
You mentioned this is best practice only for IP storage so I assume this does not apply to FCoE, but can you confirm? It seems like the root of the issue is not that it’s IP storage, it’s that it’s on the same wire/switches more so. In other words, the isolation response recommendation is being made because it is likely in an IP storage scenario that both will be down at the same time if isolated, but wouldn’t the same be true in a converged network using block storage?
Duncan says
Correct, iSCSI / NFS / Converged… It could all lead to the same scenario.