vSphere HA Isolation response when using IP Storage

Duncan Epping · Dec 15, 2011 ·

I had a question from one of my colleagues last week about the vSphere HA Isolation Response and IP Storage. His customer had an ISCSI storage infrastructure (applies to NFS also) and recently implemented a new vSphere environment. When one of the hosts was isolated virtual machines were restarted and users started reporting strange problems.

What happened was that the vSphere HA Isolation Response was configured to “Leave Powered On” and as both the Management Network and the iSCSI Network were isolated there was no “datastore heartbeating” and no “network heartbeating”. Because the datastores were unavailable the lock on the VMDKs expired (virtual disk files) and HA would be able to restart the VMs.

Now please note that HA/ESXi will power-off (or kill actually) the “ghosted VM” (the host which runs the VMs that has lost network connection) when it detects the locks cannot be re-acquired. It still means that the time between when the restart happens and the time when the isolation event is resolved potentially the IP Address and the Mac Address of the VM will pop up on the network. Of course this will only happen when your virtual machine network isn’t isolated, and as you can imagine this is not desired.

When you are running IP based storage, it is highly (!!) recommend to configure the isolation response to: power-off! For more details on configuring the isolation response please read this article which lists the best practices / recommendations.

Comments

Rickard Nobel says

15 December, 2011 at 22:31

How long time is expected for the host after being reconnected to the network until it knows that someone else has “taken” the VMDKs?

Could it be reasonable to change the Isolation Response back to Shutdown when only having IP storage and no FC?
Duncan Epping says

16 December, 2011 at 00:18

the “power off” is a matter of seconds. And yes you can select “shutdown”, but keep in mind that it could take up to 5 minutes before a VM is actually down in that case.
Rickard Nobel says

16 December, 2011 at 00:23

So the network confusion with multiple MAC and IP addresses is just a few seconds maximum?

And yes, the “shutdown” has 5 minutes default before hard power off, so if the network comes back again before that it wont help.
Duncan says

16 December, 2011 at 01:02

It would be from the time the second VM is powered on until the first VM is powered off. If “leave powered on” is selected this will be when it is detected that the lock cannot be reclaimed, that could take a whike….
Satinder Sharma says

17 December, 2011 at 04:29

Hi Duncan

How is the HA handled in case of NFS? Is the datastore heartbeat available to NFS datastores too or its VMFS only?

Thanks
Bilal Hashmi says

17 December, 2011 at 10:13

I guess in this case a power off/shut down isolation response would have been a bad thing.. The environment would have powered itself down. It’s not like the leave powered on worked out great either.. but in the end as long as we know what happened.. its great. Thanks for sharing this Duncan.
forbsy says

19 December, 2011 at 16:25

I’m still somewhat confused. Why would there be duplicate IP’s and MAC’s? What is your recommendation for IP based storage (NFS or iSCSI)?
forbsy says

19 December, 2011 at 16:29

What would happen if the host was isolated but the datastore was not – and the isolation response is ‘leave powered on’? It seems that the vm would continue to function as normal but what host would ‘own’ the vm at that point?
Duncan says

19 December, 2011 at 17:23

@forbsy: there would be duplicate IPs because both VMs would be on the network. If the Datastore is not isolated the VM cannot be restarted as the VMDK would be locked.
Duncan says

19 December, 2011 at 17:23

@SATINDER SHARMA: NFS is same concept.
udubplate says

10 January, 2013 at 19:16

You mentioned this is best practice only for IP storage so I assume this does not apply to FCoE, but can you confirm? It seems like the root of the issue is not that it’s IP storage, it’s that it’s on the same wire/switches more so. In other words, the isolation response recommendation is being made because it is likely in an IP storage scenario that both will be down at the same time if isolated, but wouldn’t the same be true in a converged network using block storage?
- Duncan says
  
  10 January, 2013 at 19:46
  
  Correct, iSCSI / NFS / Converged… It could all lead to the same scenario.

Related

Reader Interactions

Comments