Server

Deleting the vCLS VMs using Retreat Mode starting with vSphere 8.0 U2

Duncan Epping · Sep 22, 2023 ·

I posted about “retreat mode” and how to delete the vCLS VMs when needed a while back, including a quick demo. Back then you needed to configure an advanced setting for a cluster if you wanted to delete the VMs for whatever reason. (Usually for troubleshooting purposes people would do a delete/recreate.) Starting with vSphere 8.0 U2 you can now use the UI to enable retreat mode on a per cluster level. How do you do this? well fairly straight forward:

Click on the cluster you would want to delete the VMs for
Click on Configure
Click on “General” under “vSphere Cluster Services”
Click on “EDIT VCLS MODE”
Click on “Retreat Mode” and click “OK”

Now the VMs will be deleted, if you want to recreate the VMs, follow the same procedure, but change “Retreat Mode” to “System Managed”. I tested the process yesterday and created a quick demo for you:

Scalable Snapshots demo with the vSAN 8.0 Express Storage Architecture

Duncan Epping · Sep 5, 2023 ·

Starting with vSAN 8 a brand new architecture was introduced called “Express Storage Architecture”. Over the last year or so a lot of information has been shared about ESA and the benefits of ESA. One of the things which ESA introduces is much-improved snapshot scalability.

With vSAN OSA, and with VMFS, when you create a snapshot you typically immediately see a performance degradation. This is because both VMFS and vSAN OSA still operate using the redo-log based snapshot mechanism. This means that with vSAN OSA when you create a snapshot a new object is created and writes are re-directed. It also means that reads will be coming from various files, if you have one or more snapshots. This mechanism is, unfortunately, not very effective. Let me borrow a diagram that is part of a post John Nicholson wrote to demonstrate that old logic.

With vSAN 8 ESA the mechanism has changed and no longer does vSAN, or vSphere for that matter, create an additional object. vSAN ESA handles this on a meta-data level. In other words, instead of redirecting writes and traversing files for reads, vSAN now leverages a highly efficient B-Tree structure and pointers to keep track of which block is associated with which snapshot.

Not only is this more efficient from a capacity perspective, but more importantly it is very efficient from a performance standpoint. I ran half a dozen tests in my lab, and what I saw was a below 2% performance impact between a VM without a snapshot and a VM with one or multiple snapshots. I could NOT see a significant difference between the first or the fifth snapshot. I do want to point out that my lab is not officially certified to run vSAN ESA, nevertheless, I was very impressed with the results.

During the last run, I actually recorded the whole exercise. In this demo, I show the creation of one snapshot, while the VM is running a benchmark (HCIBench). Now, during the testing, I created not one but various snapshots and of course, I deleted all of them as well. You have all probably experienced extensive stun times during the deletion of a snapshot at times, and this is where vSAN ESA shines. The stun times have been reduced by 100 times, and that is something I am sure each of you will appreciate. Why have they been reduced drastically? Well, simply because we no longer have to copy data from one vSAN object to another. This makes a huge difference, not just for stun times, but also for performance in general (latency, IOPS, throughput). If you are interested, have a look at the demo!

Unexplored Territory #049 and #050, all about multi-cloud and cloud native workloads!

Duncan Epping · Jul 12, 2023 ·

I was working on my VMware Explore presentations so I forgot to post #049, figured I would post both at the same time for those who hadn’t seen these yet. In episode 049 we had two guests for the very first time, Gerrit Lehr and Andrea Siviero. Andrea and Gerrit talked us through the Multi-Cloud Adoption Framework and explained why customers are interested in this service and how it helps them meet their business goals. Listen to the full episode via Spotify (bit.ly/3Ny1EXE), Apple (bit.ly/449s2xA), or via the embedded player below.

Episode 050 focusses on Self-Managed Tanzu Mission Control, and we had Corey Dinkens as our guest. Corey discussed what Tanzu Mission Control is about, what the use case is, how customers are consuming it today, and why a self-managed solution makes sense for some customers compared to the SaaS offering. Interesting stuff if you ask me. Listen via Spotify (bit.ly/3XHU3dE), Apple (bit.ly/3XLm7g5), or use the embedded player below.

Seeing unexpected error messages during ISL failure with Stretched Cluster for secondary site

Duncan Epping · Jun 22, 2023 ·

I had a question this week from one of our field specialists, he ran into a situation where he saw lots of error messages about the fact that vSphere HA could not restart a certain workload during an ISL failure. Let me first explain the scenario, and also explain what vSAN does and doesn’t do. Let’s take the below situation.

Let’s assume Datacenter A is the “preferred site”, and Datacenter B is the “secondary site”. In case the ISL between Datacenter A and Datacenter B fails, the Witness (in a 3rd location) will bind itself automatically with Datacenter A. This means that VMs in Datacenter B will lose access to the vSAN Datastore.

From an HA perspective Datacenter A will have a primary (previously called master), and so will Datacenter B. The primary will detect that there are VMs that are not running, and it will try to restart these VMs. It will try to do this on both sides, and of course the site where access to the vSAN datastore is lost will see the restart fail.

Now here is the important aspect, of course depending on where/how vCenter Server is connected to these locations, it may, or may not, receive information about successful and unsuccessful restarts. I’ve seen situations where vCenter Server could only communicate with the primary in Datacenter B, and this would just lead to unsuccessful failover messages, while in reality all VMs were restarted in Datacenter A. The UI can give a hint by the way when you are in that situation, it will provide you the info on which host is the primary, and it will also tell you that there’s a “network isolation” or a “network partition”, and in this case of course that would be a “network partition”.

Performance Management Object reduced availability on stretched cluster

Duncan Epping · Jun 15, 2023 ·

I created a new lab environment not too long ago and I ran into this situation where the Performance Management Object showed up as Reduced Availability with no Rebuild in vSAN Skyline Health. This happened in my case because I created a Stretched Cluster configuration after I had already formed a cluster, which means that the performance management object was randomly placed across hosts without taking those “failure domains” into account. I completely forgot about it until someone on VMTN reminded me about this. I had two options, fix the existing perf database, or simply disable/enable the perf service to it is recreated.

As I had no data stored in the database I figured disable/enable is the easiest route. I looked for the option in vSphere 8.0 U1 but could not find it, it seems that the UI option no longer exists for whatever reason. How do I now disable/enable the service? Ruby vSphere Console (RVC) to the rescue!

When you log in to RVC you can simply run the following commands on the cluster object you want to disable/enable the performance service for. Fairly straight forward, and fixed the issue within a minute or so:

vsan.perf.stats_object_delete <cluster>
vsan.perf.stats_object_create <cluster>

I also documented this in the vSAN 8.0 ESA Deep Dive Book by the way, you can buy a paper copy or ebook on Amazon.