VMware

vSAN IO Trip Analyzer in 8.0 Update 3 enhanced!

Duncan Epping · Sep 25, 2024 · Leave a Comment

Starting with vSAN 8.0 Update 3 the vSAN IO Trip Analyzer has now been enhanced! What is this enhancement specifically? Well, the IO Trip Analyzer can now also be enabled on multiple VMs at the same time. Which is especially great when you are performance troubleshooting a service that consists of multiple VMs. This enhancement allows you to monitor multiple VMs simultaneously and inspect at which layer of each of the selected VMs latency occurs. Note, this enhancements works both for vSAN ESA as well as vSAN OSA!

I created a short demo that shows this capability:

New vCLS architecture with vSphere 8.0 Update 3

Duncan Epping · Sep 24, 2024 · 1 Comment

Some of you may have seen this, others may have not, as I had a question today around vCLS retreat mode with 8.0U3 I figured I would write something on the topic quickly. Starting with vSphere 8.0 Update we introduced a new architecture for vCLS aka vSphere Cluster Services. Pre-vSphere 8.0 Update the vCLS architecture was based on virtual machines with Photon OS. These VMs were there to assist in enabling and disabling primarily DRS. If something was wrong with these VMs then DRS would also be unable to function normally. In the past many of you have probably experienced situations where you had to kill and delete the vCLS VMs to restore functionality of DRS, for that VMware introduced a feature called “retreat mode” which basically killed and deleted the VMs for you. There were some other challenges with the vCLS VMs and as a result the team decided to create a new design for vCLS.

Starting with vSphere 8.0 Update 3 vCLS is now implemented as what I would call a container runtime, sometimes referred to as a Pod VM or PodCRX. In other words, when you upgrade to vSphere 8.0 Update 3 you will see your current vCLS VMs be deleted, and these new shiny vCLS VMs will pop up. How do you know if these VMs are created using a different mechanism? Well you can simply just see that in the UI as demonstrated below. See the “CRX” mention in the UI?

So you may ask yourself, why should I even care? Well the thing is, you should not indeed. The new vCLS architecture uses less resources per VM, there are less vCLS VMs deployed to begin with (two instead of three), and they are more resilient. Also, when a host is for instance placed into maintenance mode while it has a vCLS VM running, that vCLS instance is deleted and recreated elsewhere. Considering the VMs are stateless and tiny, that is much more efficient than trying to vMotion it. Note, vMotion and SvMotion of these new (Embedded as they call them) type of vCLS VMs isn’t even supported to begin with.

Normally, vCLS retreat mode shouldn’t be needed anymore, but if you do end up in a situation where you need to clean up these instances, Retreat Mode is still fully supported with 8.0 U3 as well. You can find the Retreat Mode option in the same place as before, on your cluster object under “Configure –> vSphere Cluster Services –> General –> Edit vCLS Mode”. Simply select “Retreat Mode” and the clean up should happen automatically. When you want the VMs to be recreated, simply go back to the same UI and select “System managed”. This should then lead to the vCLS VMs being recreated.

I hope this helps,

vSAN Data Protection, what can you do with it today?

Duncan Epping · Jul 2, 2024 · 13 Comments

Last week I posted a blog about how to get vSAN Data Protection up and running, but I never explained why you may want to do this in the first place? Before we dive into it, it probably is smart to also mention that vSAN Data Protection leverages the new snapshotting capability which is part of the vSAN Express Storage Architecture (vSAN ESA). vSAN ESA has a snapshotting capability that infinitely scales. It literally is 100x better than the snapshotting mechanism that is part of vSAN OSA. The reason for this is simple, with vSAN OSA (and VMFS for that matter) when you create a snapshot a new object is created and IO is redirected. With vSAN ESA we basically copy the meta data structure and keep using the existing objects for IO, which means we don’t need to traverse multiple files for reads for instance, and when we delete a snapshot we don’t need to copy data… As it is all about metadata changes / tracking.

Now in vSphere / vSAN 8.0 Update 3 we introduce a new capability called vSAN Data Protection. This provides you the ability to schedule snapshots, create immutable snapshots, clone snapshots, restore snapshots etc. All this through the vSphere Client interface, of which you can see an example below.

Now, in the above screenshot you see that half of the VMs are protected, and the other half is not. Why is that? Well, simply because I decided to only protect half of my VMs when I created my snapshot schedule. This is an “opt-in” mechanism, so you create a schedule, you decide which VMs are part of a schedule and which are not! Let’s take a look at the scheduling mechanism. You can find the mechanism at the cluster level under “Configuration -> vSAN -> Data Protection”. When you go there you see the above screen and you can click on “protection groups” and then “Create protection group”.

This will then present you a screen where you can define the name of the protection group, enable “immutability mode” and select the “Membership” of the protection group. Personally I like the Dynamic VM Patterns option. Just makes things easy. Like in this example as a pattern I used “*de*”, which means that all VMs with “de” in the name will be snapshotted whenever the schedule triggers a snaphot.

As mentioned, vSAN Data Protection includes ALL the virtual machines which match the pattern when the snapshot is taken. So if you create a schedule like for instance the following, it could take close to an hour before the VM is snapshotted for the very first time. If you feel this is too long then there’s of course also an option to manually snapshot the VM.

The schedules you can create can be really complex (up to 10 schedules), I just created an example of what is possible, but there’s many different options and combinations you can create. Note, a VM can be part of three different protection groups at most, and I also want to point out that the protection group is not a consistency group. Also note, you can have at most 200 snapshots per object/VM, so that is also something to take into consideration.

Now when the VMs are snapshotted you get a few extra options available, you have “snapshot management” at the VM level individually, and of course you can also manage things at the protection group level. You will see options like “restore” and “clone” and this is where it gets interesting, as this will allow you to go back to a particular point in time if needed from a particular snapshot, and also clone the VM from a particular snapshot. Why would you clone it? Well if you would want to test something against a specific dataset for instance, or if you want to do some type of digital forensics and analyze a VM for whatever reason.

One thing that is also good to mention is that with vSAN Data Protection, you can also recover VMs which have been deleted from vCenter Server. This is one of those must have features in my opinion, as this is one of those things that does occasionally happen, unfortunately. Power Off VM, delete … oops, wrong one! When it comes to the recovery of a VM, the process is just straight forward. You select the snapshot and click “restore VM”, or “clone VM”, depending on what you would expect as the outcome. Restore means the current instance is powered off, and the snapshot is the restore point when you power it on again. Of course, when you do a clone you simply get a second instance of that VM. Note that when you create a clone, it is a linked clone, so it is extremely fast to instantiate.

The last thing I want to discuss is the immutability mode. I want to first caution people, this is what you think it is… it is IMMUTABLE! This means that you cannot delete the snapshots or change the schedule etc, let me quote the documentation:

Enable immutability mode on a protection group for additional security. You cannot edit or delete this protection group, change the VM membership, edit or delete snapshots. An immutable snapshot is a read-only copy of data that cannot be modified or deleted, even by an attacker with administrative privileges.

That is why I wanted to caution people, because if you create an immutable protection group with hundreds of VMs by accident… Yes, snapshots will be immutable, you cannot delete the protection group or the snapshots, or change the snapshot schedule. No, not even as an administrator. Do note, you can delete the VM if you want…

Anyway, I hope this gave a brief overview of what you can do with vSAN Data Protection at the moment. I’ll play around with it some more, and report back when there’s more to share 🙂

Where’s my vSAN Data Protection screen in 8.0 U3?

Duncan Epping · Jun 28, 2024 · Leave a Comment

The first time I deployed vSphere/vSAN 8.0 U3 I immediately looked for the vSAN Data Protection UI. I always get excited about new features, and simply want to test it. I mean who doesn’t like scalable snapshots and a great way of managing snapshot schedules? Finally available within the vSphere Client! Of course, I could not find it, but I figured that was because I was on some weird alpha build of the product. Now that the product has shipped it must be there out of the box right?

No it isn’t. You will need to deploy an appliance in order for this functionality to appear in the UI. The appliance can be found under “Drivers and Tools” under the vSphere Hypervisor download (Which is under VMware vSphere), it is called “VMware vSAN Snapshot Service Appliance”. The current version is named “snapservice_appliance-8.0.3.0-24057802_OVF10.ova”. You need to deploy this OVA, and I would highly recommend to request a DNS name for it and have it properly registered. I fiddled around with the hosts file on VCSA and forgot to add the name to my local host file on my laptop and had some weird issues as a result, which I am trying to reproduce at the moment, will report back if/when I can.

The other thing to point out is the following, the documentation tells you to download the certs and copy the text for the Appliance, it isn’t something most of us do daily either, you can simply open a web browser and use the following url “https://<name of your vCenter server>/certs/download.zip” to download the certs and then unzip the downloaded file. (More details to be found here.) It will contain the certs, and if you open the cert with a proper text editor you can copy/paste that into the deployment screen for the OVA. (Yes, I know there are other ways as well, but I found this one to be the easiest.)

Now when you deployed the OVA, and when everything was configured correctly you should see a successful task, or actually two: download plugin, deploy plug, as shown in the next screenshot.

If you do get the “error downloading plug-in” error message, it likely is one of two things:

DNS / Hosts files are not correctly configured, resulting in the URL not being reachable. Make sure you can resolve the URL!
Cert thumbprint was incorrectly copied/pasted, there’s a whole section on troubleshooting this here.

Okay, now that I got the appliance up and running, I will probably do a follow-up post on what you can do with it 🙂

vSAN Stretched: Why is the witness not part of the cluster when the link between a data site and the witness fails?

Duncan Epping · Jun 25, 2024 · Leave a Comment

Last week I received a question about vSAN Stretched which had me wondering for a while what on earth was going on. The person who asked this question was running through several failure scenarios, some of which I have also documented in the past here. The question I got is what is supposed to happen when I have the following scenario as shown in the diagram and the link between the preferred site (Site A) and the witness fails:

The answer, at least that is what I thought, was simple: All VMs will remain running, or said differently, there’s no impact on vSAN. While doing the test, indeed the outcome I documented, which is also documented in the Stretched Clustering Guide and the PoC Guide was indeed the same, the VMs remain running. However, one of the things that was noticed is that when this situation occurs, and indeed the connection between Site A and the Witness is lost, the witness is somehow no longer part of the cluster, which is not what I would expect. The reason I would not expect this to happen is because if a second failure would occur, and for instance the ISL between Site A and Site B goes down, it would direclty impact all VMs. At least, that is what I assumed.

However, when I triggered that second failure and I disconnected the ISL between Site A and Site B, I saw the witness re-appearing again immidiately, I saw the witness objects going from “absent” to “active”, and more importantly, all VMs remained running. The reason this happens is fairly straight forward, when running a configuration like this vSAN has a “leader” and a “backup”, and they each run in a seperate fault domain. Both the leader and the backup need to be able to communicate with the Witness for it to be able to function correctly. If the connection between Site A and the Witness is gone, then either the leader or the backup can no longer communicate with the Witness and the Witness is taken out of the cluster.

So why does the Witness return for duty when the second failure is triggered? Well, when the second failure is triggered the leader is restarted in Site B (as Site A is deemed lost), and the backup is already running in Site B. As both the leader and the backup can communicate again with the witness, the witness returns for duty and so will all of the components automatically and instantly. Which means that even though the ISL has failed between Site A and B after the witness was taken out of the cluster, all VMs remain accessible as the witness is reintroduced instantly to ensure availability of the workload. Pretty cool! (Thanks to vSAN engineering for providing these insights on why this happens!)