Software Defined

vSAN 7.0 U2 now integrates with vSphere DRS

Duncan Epping · Mar 24, 2021 ·

One of the features our team requested a while back was integration between DRS and vSAN. The key use case we had was for stretched clusters. Especially in scenarios where a failure has occurred, it would be useful if DRS would understand what vSAN is doing. What do I mean by that?

Today when customers create a stretched cluster they have two locations. Using vSAN terminology these locations are referred to as the Preferred Fault Domain and the Secondary Fault Domain. Typically when VMs are then deployed, customers will create VM-to-Host Affinity Rules which state that VMs should reside in a particular location. When these rules are created DRS will do its best to ensure that the defined rule is adhered to. What is the problem?

Well if you are running a stretched cluster and let’s say one of the sites go down, then what happens when the failed location returns for duty is the following:

vSAN detects the missing components are available again
vSAN will start the resynchronization of the components
DRS runs every minute and rebalances and will move VMs based on the DRS rules

This means that the VMs for which rules are defined will move back to their respective location, even though vSAN is potentially still resynchronizing the data. First of all, the migration will interfere with the replication traffic. Secondly, for as long as the resync has not completed, I/O will across the network between the two locations, this will not only interfere with resync traffic, it will also increase latency for those workloads. So, how does vSAN 7.0 U2 solve this?

Starting with vSAN 7.0 U2 and vSphere 7.0 U2 we now have DRS and vSAN communicating. DRS will verify with vSAN what the state is of the environment, and it will not migrate the VMs back as long the VMs are healthy again. When the VMs are healthy and the resync has completed, you will see the rules being applied and the VMs automatically migrate back (when DRS is configured to Fully Automated that is).

I can’t really show it with a screenshot or anything, as this is a change in the vSAN/DRS architecture, but to make sure it worked I recorded a quick demo which I published through Youtube. Make sure to watch the video!

vSAN 7.0 U2 Durability Components?

Duncan Epping · Mar 22, 2021 ·

Last week I published a new demo on my youtube channel (at the bottom of this post) and it discussed an enhanced feature called Durability Components. Some may know these as “delta components” as well. These durability components were introduced in vSAN 7.0 Update 1 and provided a mechanism to maintain the required availability for VMs while doing maintenance. That meaning that when you would place a host into maintenance mode new “durability components” would be created for the components which were stored on that host. This would then allow all the new VM I/O to be committed to the existing component, as well as the durability component.

Now, starting with vSAN 7.0 Update 2, vSAN also uses these durability components in situations where a host failure has occurred. So if a host has failed, durability components will be created to ensure we still maintain the specified availability level specified within the policy as shown in the diagram above. The great thing is that if a second host fails in an FTT=1 scenario and you are able to recover the first failed host, we can still merge the data with the first failed host with the durability component! So not only are these durability components great for improving the resync times, but they also provide a higher level of availability to vSAN! To summarize:

Host fails
Durability components are created for all impacted objects
New writes are committed to existing components and the new durability components
Host recovers
Durability components are merged with the previously failed components
Durability components are deleted when resync has completed

I hope that help providing a better understanding of how these durability components help improving availability/resiliency in your environment with vSAN 7.0 Update 2.

I can understand that some of you may not want to test durability components in their own environment, this is why I recorded a quick demo and published it on my youtube channel. Check out the video below, as it also shows you how durability components are represented in the UI.

vSAN 7.0 U2 Skyline Health History

Duncan Epping · Mar 15, 2021 ·

I have already discussed this briefly during my 7.0 U2 video presentation, which can be found here, but I also wanted to share the demo I recorded with you, and provide some additional details. Over the past years, one of the most requested features for Skyline Health, or Health Check as it used to be called, was the ability to go back in time to see what has happened between certain dates. This functionality was demonstrated at VMworld, and a few VMUGs the past year, and now finally made it into the release with vSAN 7.0 U2.

The “health history” feature simply provides a toggle that enables you to go back in time. When you tick the toggle, you can specify a date range. Note that the range can be anywhere between the current date, and 30 days back. The health check data is stored, for 30 days, in the vCenter Server database. This is important to know because if you feel that there’s no need to store the data, you can disable the feature under vSAN Services in the Configuration tab. When you disable the feature the data is deleted from the vCenter Server database. Now if you flip the toggle, set a date range, and look at the different checks you should see green checks. If a check is not green, but rather red or orange you can click the check.

When clicking on the red square, you will see which check failed, and when exactly it failed. The number in the square or circle refers to the number of checks that were run and resulted in the same state. In other words, 37 red, 45 green, 54 red checks etc. When you click on it, you will see the below.

Hopefully, this will then allow you, during troubleshooting, to correlate particular failures or configuration changes, to the issue that bubbled up in the health check. I feel that having the date/time is already very useful, as it will allow you to focus on a more specific date/time range while reading logs or going through the events section.

Anyway, if you would like to see the feature in action, check out the demo below.

I joined the Futr Tech Podcast last week, check out the episode here!

Duncan Epping · Feb 16, 2021 ·

Last week I had the pleasure of joining Chris and Sandesh on the Futr Tech podcast. The episode was just published online, and I wanted to share it with all of you via this blog post. Make sure to watch/listen to the episode and subscribe to the youtube channel or podcast. I’ve been following these guys for a while, and there are some very interesting conversations to check out. (I enjoyed the episode with Bipul Sinha very much.)

You can find them on youtube here, or add them to your podcast app of choice (buzzsprout, spotify, itunes) I had fun, looking forward to some more podcasting in 2021!

What if the disk controller driver included in my vendor’s ESXi image is not on the vSAN HCL?

Duncan Epping · Jan 15, 2021 ·

Sometimes unfortunately there are situations where a vendor’s ESXi image includes a disk controller driver that is not on the vSAN HCL/VCG (VMware Compatibility Guide). Typically it is a new version of the driver which is supported for vSphere, but not yet for vSAN. In that situation, what should you do? So far there are two approaches I have seen customers take:

Keep running with the included driver and wait for the driver to be supported and listed on the vSAN HCL/VCG
Downgrade the driver to the version which is listed on the vSAN HCL/VCG

Personally, I feel that option 2 is the correct way to go. The reason is simple, vSphere and vSAN have different certification requirements for disk controllers and the vSAN certification criteria are just more stringent than vSphere’s. Hence, sometimes you see vSAN skipping certain versions of drivers, this usually means a version did not pass the tests. Now, of course, you could keep running the driver and wait for it to appear on the vSAN HCL/VC. If however, you hit a problem, VMware Support will always ask you first to bring the environment to a fully supported state. Personally, I would not want the extra stress while troubleshooting. But that is my experience and preference. Just to be clear, from a VMware stance, there’s only one option, and that is option two, downgrade to the supported version!