I have had this question multiple times by now, I wanted to answer it in the Virtual SAN FAQ but I figured I would need some diagrams and probably more than 2 or 3 sentences to explain this. How are host or disk failures in a Virtual SAN cluster handled? I guess lets start with the beginning, and I am going to try to keep it simple.
I explained some of the basics in my VSAN intro post a couple of weeks back, but it never hurts to repeat this. I think it is good to explain the IO path first before talking about the failures. Lets look at a 4 host cluster with a single VM deployed. This VM is deployed with the default policy, meaning “stripe width” of 1 and “failures to tolerate” to 1 as well. When deployed in this fashion the following is the result:
In this case you can see: 2 mirrors of the VMDKs and a witness. These VMDKs by the way are the same, they are an exact copy. What else did we learn from this (hopefully) simple diagram?
- A VM does not necessarily have to run on the same host as where its storage objects are sitting
- The witness lives on a different host than the components it is associated with in order to create an odd number of hosts involved for tiebreaking under a network partition
- The VSAN network is used for communication / IO etc
Okay, so now that we know these facts it is also worth knowing that VSAN will never place the mirror on the same host for availability reasons. When a VM writes the IO is mirrored by VSAN and will not be acknowledged back to the VM until all have completed. Meaning that in the example above both the acknowledgement from “esxi-02” and “esxi-03” will need to have been received before the write is acknowledge to the VM. The great thing here is though that all writes will go to flash/ssd, this is where the write-buffer comes in to play. At some point in time VSAN will then destage the data to your magnetic disks, but this will happen without the guest VM knowing about it…
But lets talk about failure scenarios, as that is why I started writing this in the first place. Lets take a closer look at what happens when a disk fails. The following diagram depicts the scenario where the magnetic disk of “esxi-03” fails.
In this scenario you can see that the disk of “esxi-03” has failed. VSAN responds to this failure, depending on the type, by marking all impacted components (VMDK in this example) as “degraded” and immediately creates a new mirror copy. Of course before it will create this mirror VSAN will validate if there are sufficient resources to store this new copy. The great thing here is that the virtual machine will not notice this. Well, that is not entirely true of course… The VM would be impacted performance wise if reads will need to come from disk, as in this case there is only 1 disk left instead of 2 before the failure.
One thing I found interesting was that if there are not enough resources to create that mirror copy ,VSAN will just wait until resources are added. Once you have added a new disk, or even a host, the recovery will begin. In the mean while the VM can still do IO as mentioned before, so the VMs continue to operate as normal.
So now that we know how VSAN handles a disk failure, what if a host fails? Lets paint a picture again:
This scenario is slightly different to the “disk failure”. In the case of the disk failure VSAN knew what happened, it knew the disk wasn’t coming back… But in the case of a host failure it doesn’t. This failure state is called “absent“. As soon as VSAN realizes the components (VMDK in the example above) is absent a timer will start, as explained already in the VSAN FAQ, of 60 minutes. If the component comes back within those 60 minutes the VSAN will synchronize the mirror copies. If the component doesn’t come back then VSAN will create a new mirror copy (component). Note that you can decrease this time-out value by changing the advanced setting called “VSAN.ClomRepairDelay”. (Please consult the manual or support if you want to change this value!) If for whatever reason the component returns after VSAN has started resyncing than VSAN will try to assess if it makes more sense to bring the existing but outdated component in sync or continue the creation of the new component. On top of that, VSAN also has a “rebuild throttling / QoS” mechanism, which will throttle back replication traffic during rebuild when this can impact virtual machine performance.
Easy right? I know, these new concepts can be difficult to grasp at first… Hence the reason I may sound somewhat repetitive at times, but this is valuable to know in my opinion. In the next article I will explain how the Isolation / Partition scenario works and include some HA logic in it. Before I forget, I want to thank Christian Dickmann (VSAN Dev Team) for reviewing this article.