• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

Yellow Bricks

by Duncan Epping

  • Home
  • Unexplored Territory Podcast
  • HA Deepdive
  • ESXTOP
  • Stickers/Shirts
  • Privacy Policy
  • About
  • Show Search
Hide Search

vsphere ha

Partition / Isolation and VM flip flopping between hosts?

Duncan Epping · May 16, 2016 ·

Last week I was talking to one of our developers at our R&D offsite. He has a situation where he saw his VM flip flopping between two hosts when he was testing a certain failure scenario and he wondered why that was. In his case he had a 2 node cluster connected to vCenter Server and a bunch of VMs running on just 1 host. All of the VMs were running off iSCSI storage. When looking at vCenter he literally would see his VMs on host 1 and a split second later on host 2, and this would go on continuously. I have written about this behaviour before, but figured it never hurts to repeat it as not everyone goes back 2-3 years to read up on certain scenarios.

In the above diagram you see a VM running on the first host. vCenter Server is connected to both hosts through Network A and the Datastore being used is on Network C and the host management network is connected through Network B. Now just imagine that Network B is for whatever reason gone. The hosts won’t be able to ping each other any longer. In this case although it is an isolation, the VMs will have access through the central datastore and depending on how the isolation response is configured the VMs may or may not be restarted. Either way, as the datastore is still there, even if isolation response is set to “disabled” / “leave powered on” the VM will not be restarted on the second host as the “VM” is locked through that datastore, and you cannot have 2 locks on those files.

Now if simultaneously Network B and C are gone, this could potentially pose a problem. Just imagine this to be the case. Now the hosts are able to communicate to vCenter Server, however they cannot communicate to each other (isolation event will be triggered if configured), and the VM will lose access to storage (network C is down). If no isolation event was configured (disabled or leave powered on) then the VM on the first host will remain running, but as the second host has noticed the first host is isolated and it doesn’t see the VM any longer and the lock on those files are gone it is capable of restarting that VM. Both hosts however are still connected to vCenter Server and will send their updates to vCenter Server with regards to the inventory they are running… And that is when you will see the VM flip flopping (also sometimes referred to as ping-ponging) between those hosts.

And this, this is exactly why:

  1. It is recommend to configure an Isolation Response based on the likelihood of a situation like this occurring
  2. If you have vSphere 6.0 and higher, you should enable APD/PDL responses, so that the VM running on the first host will be killed when storage is gone.

I hope this helps…

What happens to VMs when a cluster is partitioned?

Duncan Epping · Jul 1, 2015 ·

I had this question this week around what happens to VMs when a cluster is partitioned. Funny thing is that with questions like these it seems like everyone is thinking the same thing at the same time. I had the question on the same day from a customer running traditional storage and had a network failure across racks and from a customer running Virtual SAN who just wanted to know how this situation was handled. The question boils down to this, what happens to the VM in “Partition 1” when the VM is restarted in Partition 2?

The same can be asked for a traditional environment, only difference being that you wouldn’t see those “disk groups” in the bottom but a single datastore. In that case a VM can be restarted when a disk lock is lost… What happens to the VM in partition 1 that has lost access to its disk? Does the isolation response kick in? Well if you have vSphere 6.0 then potentially VMCP can help because if you have a single datastore and you’ve lost access to it (APD) then the APD response can be triggered. But if you don’t have vSphere 6.0 or don’t have VMCP configured, or if you have VSAN, what would happen? Well first of all, it is a partition scenario and not an isolation scenario. On both sides of the partition HA will have a master and hosts will be able to ping each other so there is absolutely no reason to invoke the “isolation response” as far as HA is concerned. The VM will be restarted in partition 2 and you will have it running in Partition 1, you will either need to kill it manually in Partition 1, or you will need to wait until the partition is lifted. When the partition is lifted the kernel will realize it no longer holds the lock (as it is lost it to another host) and it will kill the impacted VMs instantly.

  • « Go to Previous Page
  • Page 1
  • Page 2
  • Page 3
  • Page 4

Primary Sidebar

About the Author

Duncan Epping is a Chief Technologist and Distinguished Engineering Architect at Broadcom. Besides writing on Yellow-Bricks, Duncan is the co-author of the vSAN Deep Dive and the vSphere Clustering Deep Dive book series. Duncan is also the host of the Unexplored Territory Podcast.

Follow Us

  • X
  • Spotify
  • RSS Feed
  • LinkedIn

Recommended Book(s)

Advertisements




Copyright Yellow-Bricks.com © 2025 · Log in