vSAN

Changing the vSAN Skyline Health Interval

Duncan Epping · Feb 8, 2022 ·

On the VMTN forum Lars asked a great question, how do you change the vSAN Skyline Health interval. This used to be an option in the UI pre vSphere 7.0 but now seems to have disappeared. I never really touched it, so I had completely forgotten it was even an option at first. As vSAN also has an extensive CLI through “RVC”, and I used RVC before to disable a particular health check I figured this may also be a configurable setting, and indeed it is. It is rather straightforward:

SSH to your vCenter Server instance and open RVC. I use the following command to open an RVC session:

rvc administrator@vsphere.local@localhost

I then “cd” into my vSAN cluster object. Simply do an “ls” after you “cd” into a directory. My complete tree looks like this:

/localhost/Datacenter/computers/Cluster

When you are at the cluster level simply check the current configured interval:

vsan.health.health_check_interval_status .

Next you can configure the new internal, default setting is 60 minutes, but you can change it anywhere between 15 minutes and 1 day, I am configuring it to 15 minnutes:

vsan.health.health_check_interval_configure -i 15 .

vSAN 7.0 U3 enhanced stretched cluster resiliency, what is it?

Duncan Epping · Oct 4, 2021 ·

I briefly discussed the enhanced stretched cluster resiliency capability in my vSAN 7.0 U3 overview blog. Of course, immediately questions started popping up. I didn’t want to go too deep in that post as I figured I would do a separate post on the topic sooner or later. What does this functionality add, and in which particular scenario?

In short, this enhancement to stretched clusters prevents downtime for workloads in a particular failure scenario. So the question then is, what failure scenario? Let’s take a look at this diagram first of a typical stretched vSAN cluster deployment.

If you look at the diagram you see the following: Datacenter A, Datacenter B, Witness. One of the situations customers have found themselves in is that Datacenter A would go down (unplanned). This of course would lead to the VMs in Datacenter A being restarted in Datacenter B. Unfortunately, sometimes when things go wrong, they go wrong badly, in some cases, the Witness would fail/disappear next. Why? Bad luck, networking issues, etc. Bad things just happen. If and when this happens, there would only be 1 location left, which is Datacenter B.

Now you may think that because Datacenter B typically will have a full RAID set of the VMs running that they will remain running, but that is not true. vSAN looks at the quorum of the top layer, so if 2 out of 3 datacenters disappear, all objects impacted will become inaccessible simply as quorum is lost! Makes sense right? We are not just talking about failures right, could also be that Datacenter A has to go offline for maintenance (planned downtime), and at some point, the Witness fails for whatever reason, this would result in the exact same situation, objects inaccessible.

Starting with 7.0 U3 this behavior has changed. If Datacenter A fails, and a few (let’s say 5) minutes later the witness disappears, all replicated objects would still be available! So why is this? Well in this scenario, if Datacenter A fails, vSAN will create a new votes layout for each of the objects impacted. It basically will assume that the witness can fail and give all components on the witness 0 votes, on top of that it will give the components in the active site additional votes so that we can survive that second failure. If the witness would fail, it would not render the objects inaccessible as quorum would not be lost.

Now, do note, when a failure occurs and Datacenter A is gone, vSAN will have to create a new votes layout for each object. If you have a lot of objects this can take some time. Typically it will take a few seconds per object, and it will do it per object, so if you have a lot of VMs (and a VM consists of various objects) it will take some time. How long, well it could be five minutes. So if anything happens in between, not all objects may have been processed, which would result in downtime for those VMs when the witness would go down, as for that VM/Object quorum would be lost.

What happens if Datacenter A (and the Witness) return for duty? Well at that point the votes would be restored for the objects across locations and the witness.

Pretty cool right?!

vSAN 7.0 u3: IO Trip Analyzer

Duncan Epping · Sep 28, 2021 ·

In vSAN 7.0 U1 a new feature was introduced called IO Insight. IO Insight basically enabled customers to profile workloads and it provided them with a lot of information around the type of IO the workload was producing. In vSAN 7.0 U3 this is taken one step further with the IO Trip Analyzer. The IO Trip Analyzer provides details around the, yes you guessed it, trip of the IO. It basically informs you about the latency introduced at the various stages of the path the IO has to travel to end up on the capacity layer of vSAN.

Why would you need this? This tool is going to be very useful and will complement IO Insight when it comes to doing performance troubleshooting, or when it comes to getting a better understanding of the IO path. IO Trip Analyzer can be easily enabled by going to the “Monitor” section of the VM you want to enable it for.

You simply click “Run new test” and then specify for how long you want to analyze the VM (between 5 and 60 minutes). When the specified amount has passed you simply click on “View Result” and this will then provide you a diagram of the VM and its components.

When you then click on one of the dots, you will be able to see what kind of latency is introduced on the layer. It will provide you a potential cause for the latency and it will provide you some insights in terms of how you can potentially resolve the latency. Also, if there’s a significant amount of latency introduced then of course the diagram will show this through colors for the respective layer where the latency is introduced.

Before I share the demo, I should probably mention that there are some limitations in this first release of the IO Trip Analyzer (Not supported: Stretched Clusters, CNS persistent volumes, iSCSI etc) but I suspect those limitations will be lifted with every follow-up release of vSAN. I truly feel that this is a big improvement when it comes to performance troubleshooting, and I can’t wait to see what the vSAN team is planning for the future of this release.

Booting ESXi from SD/USB devices? Time to reconsider when buying new hardware!

Duncan Epping · Sep 17, 2021 ·

We’ve all seen those posts from people about worn-out SD/USB devices, or maybe even experience it ourselves at some point in time. Most of you reading this probably also knew there was an issue with 7.0 U2, which resulted in USB/SD devices wearing out a lot quicker. Those issues have been resolved with the latest patch for 7.0 U2. It has, however, resulted in a longer debate around whether SD/USB devices should still be used for booting ESXi, and it seems that the jury has reached a verdict.

On the 16th of September, a KB article was published by VMware, which contains statements around the future of SD/USB devices. I can be short about it, if you are buying new hardware make sure to have a proper persistent storage device, USB/SD is not the right choice going forward! Why? The volume of reads/writes to and from the OS-DATA partition continues to increase with every release, which means that the lower grade devices will simply wear out faster. Now, I am not going to repeat word for word what is mentioned in the KB, I would just like to urge everyone to read the KB article, and make sure to plan accordingly! Personally, I am a fan of M.2 flash devices for booting. They are not too expensive(greenfield deployments), plus they can provide enterprise-grade persistent storage to store all your ESXi related data. Make sure to follow the requirements around endurance though!

vSAN Storage Rules policy capability allows to set dedupe per VM?

Duncan Epping · Aug 24, 2021 ·

There was a question posted on the VMware Community Forums, and as this is something I have been asked regularly, I figured I would do a quick blog post about it. Although I have covered this before, it doesn’t hurt to repeat, as it appears to be somewhat confusing for people. When you create a VM Storage Policy, starting with vSAN 7.0 U2 you have the ability to specify if a VM needs to be Encrypted, have Dedupe and Compression enabled, have Compression-Only enabled, and/or needs to be stored on all-flash vSAN or Hybrid. Never noticed it? Look at the screenshot below.

In the screenshot, you see that you have the ability to specify which data service needs to be enabled. I guess this is where the confusion comes into play, as this functionality is not about enabling the data service for the VM to which you assign the policy. This is about which data service needs to be enabled on the datastore to which the VM can be provisioned. Huh, what? Okay, let’s explain.

If you are using vSAN as your storage platform, and you are sharing vSAN Datastores between clusters leveraging the HCI Mesh feature, then you could find yourself in a situation where some clusters are hybrid and some are all-flash. Some may have data services enabled like Encryption or Deduplication, some may not. In that scenario you want to be able to specify which features need to be enabled for the datastore the VM is provisioned to. So what this “storage rules” feature does is that it ensure that the datastore which is shown as “compatible” actually has the specified capabilities enabled! In other words, if you tick “data-at-rest encryption” in a policy and assign the policy to a VM, then only the datastores which have “data-at-rest encryption” enabled will be shown as compatible with your VM!

So again, “storage rules” apply to the data services that should be enabled on the vSAN Datastore, and do not enable data services on a per VM/VMDK basis.

<Update for vSAN 8.0 ESA>

With vSAN 8.0 ESA the above has changed. With vSAN 8.0 ESA you can set compression per VM, and you actually do that using the policy storage rules. I discussed this in this blog post.