• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

Yellow Bricks

by Duncan Epping

  • Home
  • ESXTOP
  • Stickers/Shirts
  • Privacy Policy
  • About
  • Show Search
Hide Search

Duncan Epping

vSAN Stretched Cluster failure matrix

Duncan Epping · May 30, 2023 · 1 Comment

The last couple of weeks I was involved internally in a discussion around the different vSAN stretched cluster failure scenarios. I wrote a lengthy email about how vSAN and HA would respond in certain scenarios. I have documented many of these over the years on my blog already, but never really published them as a whole.

In some of the scenarios below, I discuss a “partition”, a partition is a scenario where both the L3 connection to the witness is down and the inter site / inter switch link to the other site for one of the locations. So in the diagram above for instance, if I say that Site B is partitioned then it means that Site A can still communicate with the witness, but Site B cannot communicate with the Witness and cannot communicate with Site A either.

For all of the below scenarios the following applies, Site A is the preferred location and Site B is the secondary location. When it comes to the table, the first two columns refer to the policy setting for the VM as shown in the screenshot below. The third column refers to the location where the VM runs from a compute perspective. The fourth discusses the type of failure, and the fifth and sixth columns discuss the behavior witnessed.

Time to list the various scenarios, and no, it doesn’t include all failures that could occur but should discuss most scenarios which are important for a stretched cluster configuration. Do note, the below-discussed behavior will only be witnessed when the best practices, as documented here and here, are followed. Also note that the table has multiple pages, there are close to 30 scenarios described! If there are any questions feel free to leave a comment, if you feel a failure scenario is missing, also please leave a comment.

Site Disaster ToleranceFailures to TolerateVM LocationFailurevSAN behaviorHA behavior
None PreferredNo data redundancySite A or BHost failure Site AObjects are inaccessible if failed host contained one or more components of objectsVM cannot be restarted as object is inaccessible
None PreferredRAID-1/5/6Site A or BHost failure Site AObjects are accessible as there's site local resiliencyVM does not need to be restarted, unless VM was running on failed host
None PreferredNo data redundancy / RAID-1/5/6Site AFull failure Site AObjects are inaccessible as full site failedVM cannot be restarted in Site B, as all objects reside in Site A
None PreferredNo data redundancy / RAID-1/5/6Site BFull failure Site BObjects are accessible, as only Site A contains objectsVM can be restarted in Site A, as that is where all objects reside
None PreferredNo data redundancy / RAID-1/5/6Site APartition Site AObjects are accessible as all objects reside in Site AVM does not need to be restarted
None PreferredNo data redundancy / RAID-1/5/6Site BPartition Site BObjects are accessible in Site A, objects are not accessible in Site B as network is downVM is restarted in Site A, and killed by vSAN in Site B
None SecondaryNo data redundancy / RAID-1/5/6Site BPartition Site BObjects are accessible in Site BVM resides in Site B, does not need to be restarted
None PreferredNo data redundancy / RAID-1/5/6Site AWitness Host FailureNo impact, witness host is not used as data is not replicatedNo impact
None SecondaryNo data redundancy / RAID-1/5/6Site BWitness Host FailureNo impact, witness host is not used as data is not replicatedNo impact
Site MirroringNo data redundancySite A or BHost failure Site A or BComponents on failed hosts inaccessible, read and write IO across ISL as no redundancy locally, rebuild across ISLVM does not need to be restarted, unless VM was running on failed host
Site MirroringRAID-1/5/6Site A or BHost failure Site A or BComponents on failed hosts inaccessible, read IO locally due to RAID, rebuild locallyVM does not need to be restarted, unless VM was running on failed host
Site MirroringNo data redundancy / RAID-1/5/6Site AFull failure Site AObjects are inaccessible in Site A as full site failedVM restarted in Site B
Site MirroringNo data redundancy / RAID-1/5/6Site APartition Site AObjects are inaccessible in Site A as full site is partitioned and quorum is lostVM restarted in Site B
Site MirroringNo data redundancy / RAID-1/5/6Site AWitness Host FailureWitness object inaccessible, VM remains accessibleVM does not need to be restarted
Site MirroringNo data redundancy / RAID-1/5/6Site BFull failure Site AObjects are inaccessible in Site A as full site failedVM does not need to be restarted as it resides in Site B
Site MirroringNo data redundancy / RAID-1/5/6Site BPartition Site AObjects are inaccessible in Site A as full site is partitioned and quorum is lostVM does not need to be restarted as it resides in Site B
Site MirroringNo data redundancy / RAID-1/5/6Site BWitness Host FailureWitness object inaccessible, VM remains accessibleVM does not need to be restarted
Site MirroringNo data redundancy / RAID-1/5/6Site ANetwork failure between Site A and B (ISL down)Site A binds with witness, objects in Site B becomes inaccessibleVM does not need to be restarted
Site MirroringNo data redundancy / RAID-1/5/6Site BNetwork failure between Site A and B (ISL down)Site A binds with witness, objects in Site B becomes inaccessibleVM restarted in Site A
Site MirroringNo data redundancy / RAID-1/5/6Site A or Site BNetwork failure between Witness and Site A/BWitness object inaccessible, VM remains accessibleVM does not need to be restarted
Site MirroringNo data redundancy / RAID-1/5/6Site AFull failure Site A, and simultaneous Witness Host FailureObjects are inaccessible in Site A and Site B due to quorum being lostVM cannot be restarted
Site MirroringNo data redundancy / RAID-1/5/6Site AFull failure Site A, followed by Witness Host Failure a few minutes laterPre vSAN 7.0 U3: Objects are inaccessible in Site A and Site B due to quorum being lostVM cannot be restarted
Site MirroringNo data redundancy / RAID-1/5/6Site AFull failure Site A, followed by Witness Host Failure a few minutes laterPost vSAN 7.0 U3: Objects are inaccessible in Site A, but accessible in Site B as votes have been recountedVM restarted in Site B
Site MirroringNo data redundancy / RAID-1/5/6Site BFull failure Site B, followed by Witness Host Failure a few minutes laterPost vSAN 7.0 U3: Objects are inaccessible in Site B, but accessible in Site A as votes have been recountedVM restarted in Site A
Site MirroringNo data redundancySite AFull failure Site A, and simultaneous host failure in Site BObjects are inaccessible in Site A, if components reside on failed host then object is inaccessible in Site BVM cannot be restarted
Site MirroringNo data redundancySite AFull failure Site A, and simultaneous host failure in Site BObjects are inaccessible in Site A, if components do not reside on failed host then object is accessible in Site BVM restarted in Site B
Site MirroringRAID-1/5/6Site AFull failure Site A, and simultaneous host failure in Site BObjects are inaccessible in Site A, accessible in Site B as there's site local resiliencyVM restarted in Site B

New book: VMware vSAN 8.0 U1 Express Storage Architecture Deep Dive!

Duncan Epping · Apr 27, 2023 · 9 Comments

We already gave some hints on twitter, and during an episode of the Unexplored Territory podcast, but here it finally is… The new book, the VMware vSAN 8.0 U1 Express Storage Architecture Deep Dive! It has been a year since we released the vSAN 7.0 U3 Deep Dive book, and with this brand new vSAN architecture being introduced in vSAN 8.0 we figured it was time to do a full overhaul of the book as well. Mind you, this new book purely deals with the Express Storage Architecture, aka vSAN ESA. This also means that some of the features which are not supported by ESA are not discussed in this book, for that you will need to buy the vSAN 7.0 U3 Deep Dive book, which covers OSA. Another big change is that we brought in a third author, we asked our good friend Pete Koehler to contribute to the book. Pete had done reviews of previous books, and considering the amount of material he produced for VMware Tech Marketing for vSAN (and ESA specifically) it made a lot of sense to bring him in!

VMware’s vSAN has rapidly proven itself in environments ranging from hospitals to oil rigs to e-commerce platforms and is the market leader in the hyperconverged space. Along the way, the world of IT has rapidly changed, not just from a software point of view, but also from a hardware perspective. With vSAN 8.0 VMware brought a new architecture to market called vSAN Express Storage Architecture (ESA). This architecture is highly optimized for today’s world of datacenter resources, be it CPU, memory, networking, or NVMe based flash storage.

The authors of the vSAN Deep Dive have thoroughly updated their definitive guide to this transformative technology. Writing for vSphere administrators, architects, and consultants, Cormac Hogan, Duncan Epping , and Pete Koehler explain what vSAN ESA is, why the architecture has changed, what it now offers, and how to gain maximum value from it. The book offers expert insight into preparation, installation, configuration, policies, provisioning, clusters, architecture, and more. You’ll also find practical guidance for using all data services, stretched clusters, two-node configurations, and cloud-native storage services.

Although we pressed publish on Tuesday, sometimes it takes a while before the book is available in all Amazon stores, but it should just trickle down in the upcoming 24-48 hours. The book is priced at 9.99 USD for the ebook and 29.99 USD for a paper copy, and is sold through Amazon only. Get it while it is hot, and we would appreciate it if you would use our referral links and leave a review when you finish it. Thanks for the support, and we hope you will enjoy it!

  • paper – 29.99 USD
  • ebook – 9.99 USD

Of course, we also have the links to other major Amazon stores:

  • United Kingdom – ebook – paper
  • Germany – ebook – paper
  • Netherlands – ebook – paper
  • Canada – ebook – paper
  • France – ebook – paper
  • Spain – ebook – paper
  • India – ebook
  • Japan – ebook – paper
  • Italy – ebook – paper
  • Mexico – ebook
  • Australia – ebook – paper
  • Brazil – ebook
  • Or just do a search in your local amazon store!

VMUG Advantage Homelab Group Buy Discount offer 2023! (Also for renewals!)

Duncan Epping · Apr 19, 2023 · Leave a Comment

It is that time of the year again for many, time to renew your VMUG subscription. The minimum discount you will get is 12% and this can go up to 15% when the number of participant goes above 300, which drops the price down to 170 USD. What do you get when you sign up and buy a 12-month subscription?

  • 365-day Evaluation Licenses
    • Including vSphere 8, vSAN 8, Workstation 17 Pro, Fusion 13 Pro, NSX, vRealize, Horizon, and more!
  • 20-35% discount on training and certification
  • Access to “test drive“
  • Advantage members receive a $ 100 USD VMware Explore discount (not stackable)

The VMUG Advantage Program comes at a cost of 200 USD. Last year the discount was 15%, which means the price ended up being 170 USD for a full year. If you have just one training course planned per year, the VMUG Advantage Program will have already paid for itself (20% discount on a 3000+ USD training course). Yes, I have been talking about USD so far, but of course, this offer is available to all our community members globally (Europe, APJ, Africa, Middle East, etc). Now, again, the discount percentage you get will depend on the number of people signing up for this year’s promotion, but even if only 1 person signs up (you) you will immediately get a 12% discount. The ranges look as follows:

Quantity Discount Cost
1-199 12% $176
200-299 14% $172
300+ 15% $170

If more than 1000 people sign up, VMUG HQ will also do a raffle and give away some cool VMUG Advantage “swag”. Can’t wait? Sign up for the discount code here, and join the program! Note, the survey is open for 2 weeks, so from the 19th of April 2023 until the 3rd of May, after the survey closes the discount code will be distributed to all those who signed up.

UPDATE: The goal has been reached, and you can get a 15% discount when using the code: ADV15OFF. Note: This promotion is only available until May 3rd, 2023!

RE: Re-Imagining Ransomware Protection with VMware Ransomware Recovery

Duncan Epping · Apr 13, 2023 · 4 Comments

Last week a blog post was published on VMware’s Virtual Blocks blog on the topic of Ransomware Recovery. Some of the numbers shared were astonishing and hard to contextualize even. Global damages caused by ransomware for instance are estimated to exceed 42 billion dollars in 2024, and this is expected to be doubling every year. Also, 66% of all enterprises were hit by ransomware, of which 96% did not regain full access to their data.

Now, it explicitly mentions “enterprises”, but this does not mean that only enterprise organizations are prone to ransomware attacks. Ransomware attacks do not discriminate, every company, non-profit, and even individuals are at risk if you ask me. As a smart person once said, data is the new oil, and it seems that everyone is drilling for it, including trespassers who don’t own the land! Of course, depending on the type of organization, solutions and services are available to mitigate the risks of losing access to your company’s most valuable asset, data.

VMware, and many other vendors, have various solutions (and services) to protect your data center, your workloads, and essentially your data. But what do you do if you are breached? How do you recover? How fast can you recover, and how fast do you need to recover? How far back do you need to go, and are you allowed to go? Some of you may wonder why I ask these questions, well that has everything to do with the numbers shared at the start of this blog. Unfortunately, today, when organizations are breached malicious code is often only detected after a significant amount of time. Giving the attacker time to collect information about the environment, spread itself throughout the environment, activate the attack, and ultimately request the ransom.

This is when you, the administrator, the consultant, and the cloud admin, will get those questions. How fast can you recover? How far back do we need to go? Where do we recover to? And what about your data? All fair questions, but these shouldn’t be asked after an attack has occurred and ransom is demanded. These are questions we all need to ask constantly, and we should be aligning our Ransomware Recovery strategy with the answers to those questions.

Now, it is fair to say that I am probably somewhat biased, but it is also fair to say that I am as Dutch as it gets and I wouldn’t be writing this blog if I did not believe in this service. VMware’s Ransomware Recovery as a Service, which is part of VMware Cloud Disaster Recovery, provides a unique solution in my humble opinion. First, the service provided can just simply start as a cloud storage service to which you replicate your workloads, without needing to run a full (small but still) software-defined datacenter. This is especially useful for those organizations that can afford to take ~3hrs to spin up an SDDC when there’s a need to recover (or test the process). However, it is also possible to have an SDDC ready for recovery at all times, which will reduce the recovery time objective significantly.

Of course, VMware provides the ability to protect multiple environments, many different workloads, and many point-in-time copies (snapshots). But it also enables you to verify your recovery point (snapshot) in a fully isolated environment. What you will appreciate is that the solution will actually not only isolate the workloads, but on top of that also provide you insights at various levels about the probability of the snapshot being infected. First of all, while going through the recovery process, entropy and change rate are shown which provides insights of when potentially the environment was infected. (Or ransomware was activated for that matter.)

But maybe even more important, through the use of NSX and VMware’s Next Generation Anti-Virus software, a recovery point can be safely tried. A quarantined environment is instantiated and the recovery point can be scanned for vulnerabilities and threats, and an analysis of the workloads to be recovered can be provided, as shown below. This simplifies the recovery and validation process immensely, as it removes the need for many of the manual steps usually involved in this process. Of course, as part of the recovery process, the advanced runbook capabilities of VMware Cloud Disaster Recovery are utilized, enabling the recovery of a full data center, or simply a select group of VMs, by running a recovery plan. This recovery plan includes the order in which workloads need to be powered on and restored, but can also include IP customization, DNS registration, and more.

Depending on the outcome of the analysis, you can then determine what to do with the snapshot. Is the data not compromised? Are the workloads not infected? Are there any known vulnerabilities that we would need to mitigate first? If data is compromised, or the environment is infected in any shape or form, you can simply disregard the snapshot and clean the environment. Rinse and repeat until you find that recovery point that is not compromised! If there are known vulnerabilities, and the environment is clean, you can mitigate those and complete the recovery. Ultimately resulting in full access to your company’s most valuable asset, data.

vSAN 8.0 U1 ESA – Auto Policy Management

Duncan Epping · Mar 28, 2023 · 1 Comment

One of the features that is introduced in vSAN 8.0 U1 for ESA is Auto-Policy Management. I personally love this feature, as it will help a lot of customers make the right decision in terms of what the default policy should be on their vSAN Datastore. Now, Pete Koehler wrote a very extensive blog post, and I don’t want to copy his work and simply rewrite it, so I suggest you read his blog for the full details on this brand new feature.

I do realize that some of you are just as lazy as I am, so here’s a short summary of what Auto-Policy Management is. Auto-Policy Management, when enabled, creates a new vSAN VM storage policy based on the capabilities enabled on your cluster and the size of your cluster. After creating the policy, the policy is also assigned to the datastore as the “default policy” so that any VMs which are provisioned without the selection of a policy get this optimized policy assigned. What influences the policy characteristics? Well: size of the cluster, stretched vs normal, host rebuild reserve enabled/disabled. All those factors will determine what kind of policy is created and associated with the datastore. If over time your cluster configuration changes, well then Skyline Health will inform you that changes are required to have an optimal policy again. Wonder what that looks like? Watch the demo below!

  • « Go to Previous Page
  • Go to page 1
  • Go to page 2
  • Go to page 3
  • Go to page 4
  • Go to page 5
  • Interim pages omitted …
  • Go to page 480
  • Go to Next Page »

Primary Sidebar

About the Author

Duncan Epping is a Chief Technologist in the Office of the CTO in the Cloud Infrastructure Business Group (CIBG) at VMware. Besides writing on Yellow-Bricks, Duncan co-authors the vSAN Deep Dive book series and the vSphere Clustering Deep Dive book series. Duncan also co-hosts the Unexplored Territory Podcast.

Follow Me

  • Twitter
  • LinkedIn
  • Spotify
  • YouTube

Recommended Book(s)

Advertisements




Copyright Yellow-Bricks.com © 2023 · Log in