Last week I published a new demo on my youtube channel (at the bottom of this post) and it discussed an enhanced feature called Durability Components. Some may know these as “delta components” as well. These durability components were introduced in vSAN 7.0 Update 1 and provided a mechanism to maintain the required availability for VMs while doing maintenance. That meaning that when you would place a host into maintenance mode new “durability components” would be created for the components which were stored on that host. This would then allow all the new VM I/O to be committed to the existing component, as well as the durability component.
Now, starting with vSAN 7.0 Update 2, vSAN also uses these durability components in situations where a host failure has occurred. So if a host has failed, durability components will be created to ensure we still maintain the specified availability level specified within the policy as shown in the diagram above. The great thing is that if a second host fails in an FTT=1 scenario and you are able to recover the first failed host, we can still merge the data with the first failed host with the durability component! So not only are these durability components great for improving the resync times, but they also provide a higher level of availability to vSAN! To summarize:
- Host fails
- Durability components are created for all impacted objects
- New writes are committed to existing components and the new durability components
- Host recovers
- Durability components are merged with the previously failed components
- Durability components are deleted when resync has completed
I hope that help providing a better understanding of how these durability components help improving availability/resiliency in your environment with vSAN 7.0 Update 2.
I can understand that some of you may not want to test durability components in their own environment, this is why I recorded a quick demo and published it on my youtube channel. Check out the video below, as it also shows you how durability components are represented in the UI.
Victor Forde says
Great write up! Just 2 (similar) queries to clear it up in my own head. I understand that in FTT=1 scenario when host is down or in maintenance mode durability components are created and it does an action similar to Redirect-On-Write Snap. 1.) If as you said the remaining mirror/host fails, the VM is offline now I take it as full VM Objects are not available? When the host comes back online it will be synchronising but will service IO before the sync is complete? If it has the original and the delta could it service reads? or writes. Just trying to see if there is a pause in IO to the VM until the re-sync is complete. (2.) The other mirror/host did not fail in this scenario but when the host comes out of maintenance mode for example will all IO be service from the other mirror or can the mirror that is re-syncing service IO?
Duncan Epping says
Hi Victor. First and foremost, a mirror that is not in-sync will not serve IO. As you can imagine, serving IO from a mirror that is out of sync could be very problematic for the workload if unexpected/incorrect data is returned. When it comes to the scenario where a sync needs to happen from the delta component to the component which is not in-sync, at this stage the following would be the situation:
– host 1 goes down with component 1a
– delta component is created for component 1a on Host 3
– VM writes, delta component is updated and so is component 1b on Host 2
– Host 2 goes down
– VM cannot access disks at this point, as both component 1a and 1b are inaccessible, which means quorum is lost
– Host 1 returns, until component 1a is synced the component would be inaccessible, so the VM still can’t write to disk
– Sync completed from Delta component, disk accessible, VM can now write to disk again
Now mind you, I have not explicitly tested this situation, this is based on my theoretical understanding of the durability component, so I will validate it with the developer to ensure I understand this correctly 🙂
Duncan Epping says
Just went over some internal documentation, and my understanding is indeed correct. The component will be accessible when the resync between the durability component and the returned component has been completed.
Victor Forde says
Thanks Duncan. Appreciate the thorough response. Yes you are right updating, updating (writing to) blocks that are already being updated by the re-sync would not be advisable and better to get to a known good state before continuing to service IOs to the VM.
Hi Duncan, why durability components appear after putting a host in manteinance mode (erasure accesibility) ?
Duncan Epping says
To ensure your new data is still compliant to FTT=1?
OK, reading the article I have understand that it was related to host failure, but it seems that you are right.
We have stretched cluster scenario (11-11hosts) with some VMs in vSan default storage policy (raid5, no mirror, no site preference) that have non-persistent disks and we see 2 components in one fault domain and 2 components in the other fault domain only for the non-persistent disk components, when we put the hosts in the non preferred site in maintenance, last host is not able to enter, and the previous one, enters in maintenance with ensure availability and durability component is created. Are we missing something? We have to power off this site and we have performed the same procedure in the other site and all worked fine. BTW, we have opened a case with support and we are arranging a meeting with our TAM. Thank you in advance.