I have a lot of discussions with customers on the topic of stretched clusters, but also regular vSphere clusters. Something that often comes up is the discussion around what happens in an isolation or partition scenario. Fairly often customers (but also VMware employees) use those words interchangeably. However, a partition is not the same as an isolation. They are 2 different scenarios, and also as a result they have a different type of response associated with it. Before I explain the difference in the two responses to a situation like this, what is a partition and what is an isolation?
- An isolation event is the situation where a single host cannot communicate with the rest of the cluster. Note: single host!
- A partition is the situation where two (or more) hosts can communicate with each other, but no longer can communicate with the remaining two (or more) hosts in the cluster. Note: two or more!
Why is that such a big deal? Well the response in the case of these two scenarios are different. And the response/result is also determined by what types of configuration you have. Lets break down the scenarios one by one, including the type of infrastructure used (when it is relevant).
When a host is isolated it will:
- start an election process
- declare itself master
- ping the isolation address
- declare itself isolated
- power off / shut down VMs (when this is configured)
- communicate through the connected datastores that it is isolated
- the VMs will be restarted on the remaining hosts in the cluster
And then of course vSphere HA will be able to restart the VMs. Note that in the case of vSAN, it isn’t possible to write to the datastore when a host is isolated, so it won’t do that. Yet the workloads will still have been powered off / shutdown so it is safe for vSphere HA to restart them
Partition (traditional storage)
When two or more hosts are partitioned (they can communicate with each other) and the vSphere HA master is not part of the partition it will:
- start an election process
- declare a master in the partition
- figure out what has happened to the hosts and VMs in the other partition
- restart any VMs that somehow were impacted, or appeared now to be powered off while the last known state was powered on
- if all VMs are running, vSphere HA won’t try to restart any, this is the expected result!
Partition (vSAN stretched)
When the partition scenario happens in a stretched vSAN environment there’s an extra (potential) step. Along the way vSAN will identify all VMs which have no accessible components and kill those VMs so they can be restarted in the partition which has quorum. In this scenario you have 3 locations, two for data and 1 for the witness. If a data site loses access to the other locations then the data site is partitioned (the hosts can still communicate with each other within the site), as such the isolation response is not triggered. However, vSAN will still kill these VMs as they are rendered useless (lost access to disk).
I know it is just semantics, but nevertheless I do feel it is important to understand the difference between an isolation and a partition, especially as the response (and who responds) is different in these situations. Hope it helps,
** Disclaimer: This article contains references to the words master and/or slave. I recognize these as exclusionary words. The words are used in this article for consistency because it’s currently the words that appear in the software, in the UI, and in the log files. When the software is updated to remove the words, this article will be updated to be in alignment. **