clustering

vSAN Stretched: Why is the witness not part of the cluster when the link between a data site and the witness fails?

Duncan Epping · Jun 25, 2024 · Leave a Comment

Last week I received a question about vSAN Stretched which had me wondering for a while what on earth was going on. The person who asked this question was running through several failure scenarios, some of which I have also documented in the past here. The question I got is what is supposed to happen when I have the following scenario as shown in the diagram and the link between the preferred site (Site A) and the witness fails:

The answer, at least that is what I thought, was simple: All VMs will remain running, or said differently, there’s no impact on vSAN. While doing the test, indeed the outcome I documented, which is also documented in the Stretched Clustering Guide and the PoC Guide was indeed the same, the VMs remain running. However, one of the things that was noticed is that when this situation occurs, and indeed the connection between Site A and the Witness is lost, the witness is somehow no longer part of the cluster, which is not what I would expect. The reason I would not expect this to happen is because if a second failure would occur, and for instance the ISL between Site A and Site B goes down, it would direclty impact all VMs. At least, that is what I assumed.

However, when I triggered that second failure and I disconnected the ISL between Site A and Site B, I saw the witness re-appearing again immidiately, I saw the witness objects going from “absent” to “active”, and more importantly, all VMs remained running. The reason this happens is fairly straight forward, when running a configuration like this vSAN has a “leader” and a “backup”, and they each run in a seperate fault domain. Both the leader and the backup need to be able to communicate with the Witness for it to be able to function correctly. If the connection between Site A and the Witness is gone, then either the leader or the backup can no longer communicate with the Witness and the Witness is taken out of the cluster.

So why does the Witness return for duty when the second failure is triggered? Well, when the second failure is triggered the leader is restarted in Site B (as Site A is deemed lost), and the backup is already running in Site B. As both the leader and the backup can communicate again with the witness, the witness returns for duty and so will all of the components automatically and instantly. Which means that even though the ISL has failed between Site A and B after the witness was taken out of the cluster, all VMs remain accessible as the witness is reintroduced instantly to ensure availability of the workload. Pretty cool! (Thanks to vSAN engineering for providing these insights on why this happens!)

Black Friday Gift: Free copy of the vSphere 6.7 Clustering Deep Dive, thanks Rubrik (ebook)

Duncan Epping · Nov 23, 2018 ·

Many asked us if the ebook would be made available for free again. Today I have the pleasure of announcing that Frank, Niels and I have worked once again with Rubrik and the VMUG organization to make the vSphere 6.7 Clustering Deep Dive book available for free! Yes, that is 0 USD / EURO, or whatever your currency is. As the book signing at VMworld was wildly popular, which resulted in the follow up discussion about the ebook.

Ready to up your vSphere game? Join us at #VMworld booth #P305 for a complimentary copy of @ClusterDeepDive + the chance to meet authors @DuncanYB @FrankDenneman @NHagoort! More info: https://t.co/0DQ7nI1wzX pic.twitter.com/7nIGEvjdBF

— Rubrik (@rubrikInc) November 2, 2018

You want a copy? All that we expect you to do is register on Rubrik’s website using your own email address. Anyway, register and start your download engines, pick up a fresh copy of the vSphere Clustering Deep Dive here!

Books linked, buy paper Clustering Deep Dive get ebook for 2.95!

Duncan Epping · Oct 1, 2018 ·

We just managed to link the paper and electronic version of the Clustering Deep Dive. This means that if you buy the paper book today, you can get the e-book at a discount. This was something a lot of you have asked for, so we pushed it through. Unfortunately, it did mean we had to re-upload the book to a different back-end system and “history” is lost, so those who already bought the paper version of the book, unfortunately, can’t get the same deal. If you are interested in getting both versions, go here. Or click below on the book, or one of the other books I recommend reading 🙂

VMworld Video: vSphere 6.7 Clustering Deep Dive

Duncan Epping · Sep 3, 2018 ·

As all videos are posted for VMworld (and nicely listed by William), I figured I would share the session Frank Denneman and I presented. It ended up in the Top 10 Sessions on Monday, which is always a great honor. We had a lot of positive feedback and comments, thanks for that! Most importantly, it was a lot of fun again to be up on stage at VMworld talking about this content after 6 years of absence or so. For those who missed it, watch it here:

Also very much enjoyed the book signing session at the Rubrik booth with Niels and Frank. I believe Rubrik gave away around 1000 copies of the book. Hoping we can repeat this huge success in EMEA. But more on that later. If you haven’t picked up the book yet and won’t be at VMworld Europe, consider picking it up through Amazon, e-book is 14.95 USD only.

Change in Permanent Device Loss (PDL) behavior for vSphere 5.1 and up?

Duncan Epping · Aug 1, 2013 ·

Yesterday someone asked me a question on twitter about a whitepaper by EMC on stretched clusters and Permanent Device Loss (PDL) behavior. For those who don’t know what a PDL is, make sure to read this article first. This EMC whitepaper states the following on page 40:

Note:

In a full WAN partition that includes cross-connect, VPLEX can only send SCSI sense code (2/4/3+5) across 50% of the paths since the cross-connected paths are effectively dead. When using ESXi version 5.1 and above, ESXi servers at the non-preferred site will declare PDL and kill VM’s causing them to restart elsewhere (assuming advanced settings are in place); however ESXi 5.0 update 1 and below will only declare APD (even though VPLEX is sending sense code 2/4/3+5). This will result in a VM zombie state. Please see the section Path loss handling semantics (PDL and APD)

Now as far as I understood, and I tested this with 5.0 U1 the VMs would not be killed indeed when half of the paths were declared APD and the other half PDL. But I guess something has changed with vSphere 5.1. I knew about one thing that has changed which isn’t clearly documented so I figured I would do some digging and write a short article on this topic. So here are the changes in behavior:

Virtual Machine using multiple Datastores:

vSphere 5.0 u1 and lower: When a Virtual Machine’s files are spread across multiple Datastores it might not be restarted in the case a Permanent Device Loss scenario occurs.
vSphere 5.1 and higher: When a Virtual Machine’s files are spread across multiple Datastores and a Permanent Device Loss scenario occurs then vSphere HA will restart the virtual machine taking availability of those datastores on the various hosts in your cluster in to account.

Half of the paths in APD state:

vSphere 5.0 u1 and lower: When a datastore on which your virtual machine resides is not in a 100% declared in a PDL state (assume half of the paths in APD) then the virtual machine will not be killed and restarted.
vSphere 5.1 and higher: When a datastore on which your virtual machine resides is not in a 100% declared in a PDL state (assume half of the paths in APD) then the virtual machine will be killed and restarted. This is a huge change compared to 5.0 U1 and lowe

These are the changes in behavior I know about for vSphere 5.1, I have asked engineering to confirm these changes for vSphere Metro Storage Cluster environments. When I have received an answer I will update this blog.