• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

Yellow Bricks

by Duncan Epping

  • Home
  • ESXTOP
  • Stickers/Shirts
  • Privacy Policy
  • About
  • Show Search
Hide Search

vSAN Stretched Cluster failure matrix

Duncan Epping · May 30, 2023 · 1 Comment

The last couple of weeks I was involved internally in a discussion around the different vSAN stretched cluster failure scenarios. I wrote a lengthy email about how vSAN and HA would respond in certain scenarios. I have documented many of these over the years on my blog already, but never really published them as a whole.

In some of the scenarios below, I discuss a “partition”, a partition is a scenario where both the L3 connection to the witness is down and the inter site / inter switch link to the other site for one of the locations. So in the diagram above for instance, if I say that Site B is partitioned then it means that Site A can still communicate with the witness, but Site B cannot communicate with the Witness and cannot communicate with Site A either.

For all of the below scenarios the following applies, Site A is the preferred location and Site B is the secondary location. When it comes to the table, the first two columns refer to the policy setting for the VM as shown in the screenshot below. The third column refers to the location where the VM runs from a compute perspective. The fourth discusses the type of failure, and the fifth and sixth columns discuss the behavior witnessed.

Time to list the various scenarios, and no, it doesn’t include all failures that could occur but should discuss most scenarios which are important for a stretched cluster configuration. Do note, the below-discussed behavior will only be witnessed when the best practices, as documented here and here, are followed. Also note that the table has multiple pages, there are close to 30 scenarios described! If there are any questions feel free to leave a comment, if you feel a failure scenario is missing, also please leave a comment.

Site Disaster ToleranceFailures to TolerateVM LocationFailurevSAN behaviorHA behavior
None PreferredNo data redundancySite A or BHost failure Site AObjects are inaccessible if failed host contained one or more components of objectsVM cannot be restarted as object is inaccessible
None PreferredRAID-1/5/6Site A or BHost failure Site AObjects are accessible as there's site local resiliencyVM does not need to be restarted, unless VM was running on failed host
None PreferredNo data redundancy / RAID-1/5/6Site AFull failure Site AObjects are inaccessible as full site failedVM cannot be restarted in Site B, as all objects reside in Site A
None PreferredNo data redundancy / RAID-1/5/6Site BFull failure Site BObjects are accessible, as only Site A contains objectsVM can be restarted in Site A, as that is where all objects reside
None PreferredNo data redundancy / RAID-1/5/6Site APartition Site AObjects are accessible as all objects reside in Site AVM does not need to be restarted
None PreferredNo data redundancy / RAID-1/5/6Site BPartition Site BObjects are accessible in Site A, objects are not accessible in Site B as network is downVM is restarted in Site A, and killed by vSAN in Site B
None SecondaryNo data redundancy / RAID-1/5/6Site BPartition Site BObjects are accessible in Site BVM resides in Site B, does not need to be restarted
None PreferredNo data redundancy / RAID-1/5/6Site AWitness Host FailureNo impact, witness host is not used as data is not replicatedNo impact
None SecondaryNo data redundancy / RAID-1/5/6Site BWitness Host FailureNo impact, witness host is not used as data is not replicatedNo impact
Site MirroringNo data redundancySite A or BHost failure Site A or BComponents on failed hosts inaccessible, read and write IO across ISL as no redundancy locally, rebuild across ISLVM does not need to be restarted, unless VM was running on failed host
Site MirroringRAID-1/5/6Site A or BHost failure Site A or BComponents on failed hosts inaccessible, read IO locally due to RAID, rebuild locallyVM does not need to be restarted, unless VM was running on failed host
Site MirroringNo data redundancy / RAID-1/5/6Site AFull failure Site AObjects are inaccessible in Site A as full site failedVM restarted in Site B
Site MirroringNo data redundancy / RAID-1/5/6Site APartition Site AObjects are inaccessible in Site A as full site is partitioned and quorum is lostVM restarted in Site B
Site MirroringNo data redundancy / RAID-1/5/6Site AWitness Host FailureWitness object inaccessible, VM remains accessibleVM does not need to be restarted
Site MirroringNo data redundancy / RAID-1/5/6Site BFull failure Site AObjects are inaccessible in Site A as full site failedVM does not need to be restarted as it resides in Site B
Site MirroringNo data redundancy / RAID-1/5/6Site BPartition Site AObjects are inaccessible in Site A as full site is partitioned and quorum is lostVM does not need to be restarted as it resides in Site B
Site MirroringNo data redundancy / RAID-1/5/6Site BWitness Host FailureWitness object inaccessible, VM remains accessibleVM does not need to be restarted
Site MirroringNo data redundancy / RAID-1/5/6Site ANetwork failure between Site A and B (ISL down)Site A binds with witness, objects in Site B becomes inaccessibleVM does not need to be restarted
Site MirroringNo data redundancy / RAID-1/5/6Site BNetwork failure between Site A and B (ISL down)Site A binds with witness, objects in Site B becomes inaccessibleVM restarted in Site A
Site MirroringNo data redundancy / RAID-1/5/6Site A or Site BNetwork failure between Witness and Site A/BWitness object inaccessible, VM remains accessibleVM does not need to be restarted
Site MirroringNo data redundancy / RAID-1/5/6Site AFull failure Site A, and simultaneous Witness Host FailureObjects are inaccessible in Site A and Site B due to quorum being lostVM cannot be restarted
Site MirroringNo data redundancy / RAID-1/5/6Site AFull failure Site A, followed by Witness Host Failure a few minutes laterPre vSAN 7.0 U3: Objects are inaccessible in Site A and Site B due to quorum being lostVM cannot be restarted
Site MirroringNo data redundancy / RAID-1/5/6Site AFull failure Site A, followed by Witness Host Failure a few minutes laterPost vSAN 7.0 U3: Objects are inaccessible in Site A, but accessible in Site B as votes have been recountedVM restarted in Site B
Site MirroringNo data redundancy / RAID-1/5/6Site BFull failure Site B, followed by Witness Host Failure a few minutes laterPost vSAN 7.0 U3: Objects are inaccessible in Site B, but accessible in Site A as votes have been recountedVM restarted in Site A
Site MirroringNo data redundancySite AFull failure Site A, and simultaneous host failure in Site BObjects are inaccessible in Site A, if components reside on failed host then object is inaccessible in Site BVM cannot be restarted
Site MirroringNo data redundancySite AFull failure Site A, and simultaneous host failure in Site BObjects are inaccessible in Site A, if components do not reside on failed host then object is accessible in Site BVM restarted in Site B
Site MirroringRAID-1/5/6Site AFull failure Site A, and simultaneous host failure in Site BObjects are inaccessible in Site A, accessible in Site B as there's site local resiliencyVM restarted in Site B

Related

Server, Storage, vSAN stretched, stretched cluster, VMware, vsan stretched, vsan stretched cluster, vSphere

Reader Interactions

Comments

  1. ISHAG SEEDAHMED says

    10 June, 2023 at 17:55

    Very Informative!

    Reply

Leave a ReplyCancel reply

Primary Sidebar

About the Author

Duncan Epping is a Chief Technologist in the Office of the CTO in the Cloud Infrastructure Business Group (CIBG) at VMware. Besides writing on Yellow-Bricks, Duncan co-authors the vSAN Deep Dive book series and the vSphere Clustering Deep Dive book series. Duncan also co-hosts the Unexplored Territory Podcast.

Follow Me

  • Twitter
  • LinkedIn
  • Spotify
  • YouTube

Recommended Book(s)

Advertisements




Copyright Yellow-Bricks.com © 2023 · Log in