A couple of years ago I did a VMworld session with Cormac and we discussed the top things everyone should know about vSAN. One of the items discussed was which policy changes would trigger a rebuild. We tested the various situations and documented them. Two weeks ago a question around this was asked on a VMware internal Slack channel so I shared our findings. Considering it is already a few years ago, I wanted to make sure that our documented findings were still valid, so I redid the tests.
Now before I provide a table with the findings, I just want to explain what I tested, what I did is I created a VM with a default policy. I dumped a bunch of random data on the two VMDKs attached to the VM, and I then changed the policy of the VM while the VM is running. After changing the policy I verified through the command-line, and UI, if a rebuild of the objects was occurring or not. In some cases a policy change does not require a rebuild, while in other cases it does. This, of course, depends on what is being changed within the policy, and what that means for the objects associated with the policy. Hopefully, you will find the below table useful.
From | To | Resync |
---|---|---|
RAID-1 | RAID-1 with higher FTT | Yes |
RAID-1 | RAID-1 with lower FTT | No |
RAID-1 | RAID-5/6 | Yes |
RAID-5/6 | RAID-1 | Yes |
RAID-5 | RAID-6 | Yes |
RAID-6 | RAID-5 | Yes |
Stripe width 1 | Stripe width increase by 1 (or more) | Yes |
Stripe width x | Stripe width decrease by 1 (or more) | Yes |
Space Reservation 0 | Increase to larger than 0 | No |
Space Reservation >= 1 | Increase by 1 (or more) | No |
Space reservation > 0 | Decrease to 0 | No |
Read Cache 0 | Increase to larger than 0 | No |
Read Cache >= 1 | Increase by 1 (or more) | No |
Read Cache >= 1 | Decrease by 1 (or more) | No |
Checksum enabled | Checksum disabled | No |
Checksum disabled | Checksum enabled | Yes |
James Doyle says
Hi Duncan,
Good post and very pertinent information.
By rebuild, though, do you mean resync? From my understanding, increasing a RAID-1 FTT value will trigger a resync operation, but not a rebuild. When you increase FTT value, additional mirrors are added to the existing RAID tree, with data copied from the existing mirrors (i.e. a resync). However, the original mirrors are not removed at the end of the process. The only additional capacity that is required for this operation is the capacity for the extra mirror.
However, a change from RAID-1 to a RAID-5/6 layout requires that an entirely new RAID structure to be built (a rebuild). This means that the original RAID tree remains in place and are marked as transient data, while data is copied to the new RAID tree structure. At the end of the process, the original RAID tree is removed. This means that you need additional capacity for a complete new RAID structure in place before the operation is started. This has a significant impact if you plan to perform this on multiple objects at the same time.
I think it might be a good idea to update the table to show which operations require transient capacity during the rebuild, as well as operations which create resync traffic on the backend.
Duncan Epping says
Ah yes, I should have fixed the table with “Resync” instead of rebuild indeed. And I was planning on having a row with an explanation around what happens, but didn’t get to it.