• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

Yellow Bricks

by Duncan Epping

  • Home
  • ESXTOP
  • Stickers/Shirts
  • Privacy Policy
  • About
  • Show Search
Hide Search

7.0 u3

New book: VMware vSAN 7.0 U3 Deep Dive

Duncan Epping · May 9, 2022 · Leave a Comment

Yes, we’ve mentioned it a few times already on Twitter that we were working on it, but today Cormac and I are proud to announce that the VMware vSAN 7.0 U3 Deep Dive is available via Amazon on both ebook as well as paper! We had the pleasure of working with Pete Koehler again as a technical editor, the foreword was written by John Gilmartin (SVP and GM for Cloud Storage and Data), the cover was created by my son (Aaron Epping), and it is once again fully self-published! We changed the format (physical dimension) of the book to be able to increase the size of the screenshots, as we realize that most of us are middle-aged by now, we feel it really made a huge difference in readability.

VMware’s vSAN has rapidly proven itself in environments ranging from hospitals to oil rigs to e-commerce platforms and is the top player in the hyperconverged space. Along the way, it has matured to offer unsurpassed features for data integrity, availability, space efficiency, stretched clustering, and cloud-native storage services. vSAN 7.0 U3 has radically simplified IT operations and supports the transition to hyperconverged infrastructures (HCI). The authors of the vSAN Deep Dive have thoroughly updated their definitive guide to this transformative technology. Writing for vSphere administrators, architects, and consultants, Cormac Hogan, and Duncan Epping explain what vSAN is, how it has evolved, what it now offers, and how to gain maximum value from it. The book offers expert insight into preparation, installation, configuration, policies, provisioning, clusters, architecture, and more. You’ll also find practical guidance for using all data services, stretched clusters, two-node configurations, and cloud-native storage services.

Although we pressed publish, sometimes it takes a while before the book is available in all Amazon stores, but it should just trickle in the upcoming 24-48 hours. The book is priced at 9.99 USD (ebook) and 29.99 USD (paper) and is sold through Amazon only. Get it while it is hot, and we would appreciate it if you would use our referral links and leave a review when you finish it. Thanks, and we hope you will enjoy it!

  • Paper book – 29.99 USD
  • Ebook – 9.99 USD

Of course, we also have the links to other major Amazon stores:

  • United Kingdom – Kindle – Paper
  • Germany – Kindle – Paper
  • Netherlands – Kindle – Paper
  • Canada – Kindle – Paper
  • France – Kindle – Paper
  • Spain – Kindle – Paper
  • India – Kindle
  • Japan – Kindle – Paper
  • Italy – Kindle – Paper
  • Mexico – Kindle
  • Australia – Kindle – Paper
  • Or just do a search!

Stretched cluster witness failure resilience in vSAN 7.0

Duncan Epping · Mar 17, 2022 · Leave a Comment

Cormac and I have been busy the past couple of weeks updating the vSAN Deep Dive to 7.0 U3. Yes, there is a lot to update and add, but we are actually going through it at a surprisingly rapid pace. I guess it helps that we had already written dozens of blog posts on the various topics we need to update or add. One of those topics is “witness failure resilience” which was introduced in vSAN 7.0 U3. I have discussed it before on this blog (here and here) but I wanted to share some of the findings with you folks as well before the book is published. (No, I do not know when the book will be available on Amazon just yet!)

In the scenario below, we failed the secondary site of our stretched cluster completely. We can examine the impact of this failure through RVC on vCenter Server. This will provide us with a better understanding of the situation and how the witness failure resilience mechanism actually works. Note that the below output has been truncated for readability reasons. Let’s take a look at the output of RVC for our VM directly after the failure.

VM R1-R1:
Disk backing:
[vsanDatastore] 0b013262-0c30-a8c4-a043-005056968de9/R1-R1.vmx
RAID_1
RAID_1
Component: 0b013262-c2da-84c5-1eee-005056968de9 , host: 10.202.25.221
votes: 1, usage: 0.1 GB, proxy component: false)
Component: 0b013262-3acf-88c5-a7ff-005056968de9 , host: 10.202.25.201
votes: 1, usage: 0.1 GB, proxy component: false)
RAID_1
Component: 0b013262-a687-8bc5-7d63-005056968de9 , host: 10.202.25.238
votes: 1, usage: 0.1 GB, proxy component: true)
Component: 0b013262-3cef-8dc5-9cc1-005056968de9 , host: 10.202.25.236
votes: 1, usage: 0.1 GB, proxy component: true)
Witness: 0b013262-4aa2-90c5-9504-005056968de9 , host: 10.202.25.231
votes: 3, usage: 0.0 GB, proxy component: false)
Witness: 47123362-c8ae-5aa4-dd53-005056962c93 , host: 10.202.25.214
votes: 1, usage: 0.0 GB, proxy component: false)
Witness: 0b013262-5616-95c5-8b52-005056968de9 , host: 10.202.25.228
votes: 1, usage: 0.0 GB, proxy component: false)

As can be seen, the witness component holds 3 votes, the components on the failed site (secondary) hold 2 votes, and the components on the surviving data site (preferred) hold 2 votes. After the full site failure has been detected, the votes are recalculated to ensure that a witness host failure does not impact the availability of the VMs. Below shows the output of RVC once again.

VM R1-R1:
Disk backing:
[vsanDatastore] 0b013262-0c30-a8c4-a043-005056968de9/R1-R1.vmx
RAID_1
RAID_1
Component: 0b013262-c2da-84c5-1eee-005056968de9 , host: 10.202.25.221
votes: 3, usage: 0.1 GB, proxy component: false)
Component: 0b013262-3acf-88c5-a7ff-005056968de9 , host: 10.202.25.201
votes: 3, usage: 0.1 GB, proxy component: false)
RAID_1
Component: 0b013262-a687-8bc5-7d63-005056968de9 , host: 10.202.25.238
votes: 1, usage: 0.1 GB, proxy component: false)
Component: 0b013262-3cef-8dc5-9cc1-005056968de9 , host: 10.202.25.236
votes: 1, usage: 0.1 GB, proxy component: false)
Witness: 0b013262-4aa2-90c5-9504-005056968de9 , host: 10.202.25.231
votes: 1, usage: 0.0 GB, proxy component: false)
Witness: 47123362-c8ae-5aa4-dd53-005056962c93 , host: 10.202.25.214
votes: 3, usage: 0.0 GB, proxy component: false)

As can be seen, the votes for the various components have changed, the data site now has 3 votes per component instead of 1, the witness on the witness host went from 3 votes to 1, and on top of that, the witness that is stored in the surviving fault domain now also has 3 votes instead of 1 vote. This now results in a situation where quorum would not be lost even if the witness component on the witness host is impacted by a failure. A very useful enhancement to vSAN 7.0 Update 3 for stretched cluster configurations if you ask me.

vSAN 7.0 U3 enhanced stretched cluster resiliency, what is it?

Duncan Epping · Oct 4, 2021 · 3 Comments

I briefly discussed the enhanced stretched cluster resiliency capability in my vSAN 7.0 U3 overview blog. Of course, immediately questions started popping up. I didn’t want to go too deep in that post as I figured I would do a separate post on the topic sooner or later. What does this functionality add, and in which particular scenario?

In short, this enhancement to stretched clusters prevents downtime for workloads in a particular failure scenario. So the question then is, what failure scenario? Let’s take a look at this diagram first of a typical stretched vSAN cluster deployment.

If you look at the diagram you see the following: Datacenter A, Datacenter B, Witness. One of the situations customers have found themselves in is that Datacenter A would go down (unplanned). This of course would lead to the VMs in Datacenter A being restarted in Datacenter B. Unfortunately, sometimes when things go wrong, they go wrong badly, in some cases, the Witness would fail/disappear next. Why? Bad luck, networking issues, etc. Bad things just happen. If and when this happens, there would only be 1 location left, which is Datacenter B.

Now you may think that because Datacenter B typically will have a full RAID set of the VMs running that they will remain running, but that is not true. vSAN looks at the quorum of the top layer, so if 2 out of 3 datacenters disappear, all objects impacted will become inaccessible simply as quorum is lost! Makes sense right? We are not just talking about failures right, could also be that Datacenter A has to go offline for maintenance (planned downtime), and at some point, the Witness fails for whatever reason, this would result in the exact same situation, objects inaccessible.

Starting with 7.0 U3 this behavior has changed. If Datacenter A fails, and a few (let’s say 5) minutes later the witness disappears, all replicated objects would still be available! So why is this? Well in this scenario, if Datacenter A fails, vSAN will create a new votes layout for each of the objects impacted. It basically will assume that the witness can fail and give all components on the witness 0 votes, on top of that it will give the components in the active site additional votes so that we can survive that second failure. If the witness would fail, it would not render the objects inaccessible as quorum would not be lost.

Now, do note, when a failure occurs and Datacenter A is gone, vSAN will have to create a new votes layout for each object. If you have a lot of objects this can take some time. Typically it will take a few seconds per object, and it will do it per object, so if you have a lot of VMs (and a VM consists of various objects) it will take some time. How long, well it could be five minutes. So if anything happens in between, not all objects may have been processed, which would result in downtime for those VMs when the witness would go down, as for that VM/Object quorum would be lost.

What happens if Datacenter A (and the Witness) return for duty? Well at that point the votes would be restored for the objects across locations and the witness.

Pretty cool right?!

vSphere 7.0 U3 contains two great vCLS enhancements

Duncan Epping · Sep 28, 2021 · 13 Comments

I have written about vCLS a few times, so I am not going to explain what it is or what it does (detailed blog here). I do want to talk about what is part of vSphere 7.0 U3 specifically though as I feel these features are probably what most folks have been waiting for. Starting with vSphere 7.0 U3 it is now possible to configure the following for vCLS VMs:

  1. Preferred Datastores for vCLS VMs
  2. Anti-Affinity for vCLS VMs with specific other VMs

I created a quick demo for those who prefer to watch videos to learn these things if you don’t skip to the text below. Oh and before I forget, a bonus enhancement is that the vCLS VMs now have a unique name, this was a very common request from customers.

Why would you need the above functionality? Let’s begin with the “preferred datastore” feature, this allows you to specify where the vCLS VMs need to be provisioned to from a storage point of view. This would be useful in a scenario where you have a number of datastores that you would prefer to avoid. Examples could be datastores that are replicated or a datastore that is only intended to be used for ISOs and templates, or maybe you prefer to provision on hybrid storage versus flash storage.

So how do you fix this? Well, it is simple, you click on your cluster object. You then click on “Configure”, and on “Datastores” under “vSphere Cluster Services”. Now you will see “VCLS Allowed”, if you click on “ADD” you now will be able to select the datastores to which these vCLS VMs should be provisioned.

Next up Anti-Affinity for vCLS. You would this feature for situations where for instance a single workload needs to be able to solely run on a host, something like SAP for instance. In order to achieve this, you can use anti-affinity rules. We are not talking about regular anti-affinity rules. This is the very first time a brand new mechanism is used on-premises. I am talking about compute policies. Compute policies have been available for VMware Cloud on AWS customers for a while, but now are also appear to be coming to on-prem customers. What does it do? It enables you to create “anti-affinity” rules for vCLS VMs and specific other VMs in your environment by creating Compute Policies and using Tags!

How does this work? Well, you go to “Policies and Profiles” and then click “Compute Policies”. Now you can click “ADD” and create a policy. You now select “Anti Affinity with vSphere Cluster Services (vCLS) VMs”. Then you select the Tag you created for the VMs that should not run on the same hosts as the vCLS VMs, and then you click create. The vCLS VM Scheduler will then ensure that the vCLS VMs will not run on the same hosts as the tagged VMs. If there’s a conflict, the vCLS Scheduler will move away the vCLS VMs to other hosts within the cluster. Let’s reiterate that, the vCLS VMs will be vMotioned to another host in your cluster, the tagged VMs will not be moved!

Hope that helps!

vSAN 7.0 U3 feature overview

Duncan Epping · Sep 28, 2021 · 16 Comments

In this blog post, I want to go over the features which have been released for vSAN as part of 7.0 U3. It is not going to be a deep dive, just a simple overview as most features speak for themselves! Let’s list the feature first, and then discuss some of them individually.

  • Cluster Shutdown feature
  • vLCM support for Witness Appliance
  • Skyline Health Correlation
  • IO Trip Analyzer
  • Nested Fault Domains for 2-Node Clusters
  • Enhanced Stretched Cluster durability
  • Access Based Enumeration for SMB shares via vSAN File Services

I guess the Cluster Shutdown feature speaks for itself. It basically enables you to power off all the hosts in a cluster when doing maintenance. Even if those hosts contain vCenter Server! If you want to trigger a shutdown, just right click the cluster object, go to vSAN, select “shutdown cluster” and follow the 2-step wizard. Pretty straight forward. Do note, besides the agent VMs, you will need to power-off the other VMs first. (yes, I requested this to be handled by the process as well in the future!)

The Skyline Health Correlation feature is very useful for customers who are seeing multiple alarms being triggered and are not sure what to do. In this scenario, starting with 7.0 U3, vSAN will now understand the correlation between the events and inform you what the issue is (most likely) and show you which other tests it would impact. This should enable you to fix the problem faster than before.

IO Trip Analyzer is also brand new in 7.0 U3. I actually wrote a blog post on the subject separately and included a demo, I would recommend watching that one. But if you just want the short summary, the IO Trip Analyzer basically provides an overview of latency introduced at every layer in the form of a diagram!

Nested Fault Domains for 2-node clusters has been on the “wish list” of some of our customers for a while. It is a very useful feature for those customers who want to be able to tolerate multiple failures even in a 2-node configuration. The feature requires you to have at least 3 disk groups per host, in each of the 2 hosts, and will then enable you to have “RAID-1” across those two hosts and RAID-1 within the host (Or RAID-5 if you have sufficient disk groups). If a host fails, and then a disk fails in the surviving host, the VMs would still be available. Basically a feature for customers who don’t need a lot of compute power (3 hosts or more), but do need added resiliency!

Enhanced Stretched Cluster durability (also applies to 2-node) is a feature that Cormac and I requested a while back. We requested this feature as we had heard from a few customers that unfortunately, they had found themselves in a situation where a datacenter would go offline, followed by the witness going down. This would then result in the VMs (only those which were stretched of course) in the remaining location also be unusable, as 2 out of the 3 parts of the RAID tree would be gone. This would even be the case in a situation where you would have a fully available RAID-1 / RAID-5 / RAID-6 tree in the remaining datacenter. This new feature now prevents this scenario!

Last, but not least, we now have support for Access Based Enumeration for SMB shares via vSAN File Services. What does this mean? Pre vSAN 7.0 U3, if a user had access to a file share the user would be able to see all folders/directories in this share. Starting with 7.0 U3 when looking at the share, only the folders that you have the appropriate permissions for will be displayed! (More about ABE here)

  • Go to page 1
  • Go to page 2
  • Go to Next Page »

Primary Sidebar

About the author

Duncan Epping is a Chief Technologist in the Office of CTO of the Cloud Platform BU at VMware. He is a VCDX (# 007), the author of the "vSAN Deep Dive" and the “vSphere Clustering Technical Deep Dive” series, and he is the host of the "In de aap gelogeerd" (Dutch) and "unexplored territory" (English) podcasts.

Upcoming Events

09-06-2022 – VMUG Belgium
16-06-2022 – VMUG Sweden

Recommended Reads

Sponsors

Want to support Yellow-Bricks? Buy an advert!

Advertisements

Copyright Yellow-Bricks.com © 2022 · Log in