I am starting to get some more questions about vSAN Adaptive Resync lately. This was introduced a while back, but is also available in the latest versions of vSAN through vSphere 6.5 Patch 02. As a result various folks have started to look at it and are starting to wonder what it is. Hopefully by now everyone understands what resync traffic is and when you see resync traffic. The easiest example of course is a host failure. If a host has failed and there’s sufficient disk space and there’s additional hosts available to make the impacted VMs compliant with their policy again then vSAN will resync the data.
Resync aims to finish the creation of these new components asap, simple reason for this is availability. The longer the resync takes, the longer you are at risk. I think that makes sense right? In some cases however it may occur that when VMs are very busy and resync is happening that VM observed latency goes through the roof. We already had a manual throttling mechanism for when this situation occurs, but of course preferably vSAN should throttle resync traffic properly for you. This is what vSAN Adaptive Resync does.
So how does that work? Well, when the high watermark is reached for VM latency then vSAN will cut the bandwidth of resync in half. Next vSAN will check if the VM latency is below the low watermark, if not then it will cut resync traffic in half again. It does this until the latency is below the low watermark. When the latency is below the low watermark then vSAN will increase the bandwidth of resync traffic granularly until the low watermark is reached and stay at that level. (Some official info can be found in this kb, and this virtual blocks blog.)
Hope that helps,
Florian says
Hey Duncan, great article as usual 😉 Resync has always been challenging with vSAN but the system gets better since we have PFTT/SFTT and other features improving data resiliency. I have a question on the topic by the way : What happens when a local site does not respect SFTT anymore due to multiple disk/server loss?
A/ The entire site is marked as failed and all VM are restarted on the remaining site
B/ Only affected components are marked as failed and the cluster use remote resources to restart VM (or rebuild data)
The documentation is not clear on the topic.