vSAN

Mixing versions of ESXi in the same vSphere / vSAN cluster?

Duncan Epping · Apr 15, 2019 ·

I have seen this question being asked a couple of times in the past months, and to be honest I was a bit surprised people asked about this. Various customers were wondering if it is supported to mix versions of ESXi in the same vSphere or vSAN Cluster? I can be short whether this is support or not, yes it is but only for short periods of time (72hrs max). Would I recommend it? No, I would not!

Why not? Well mainly for operational reasons, it just makes life more complex. Just think about a troubleshooting scenario, you now need to remember which version you are running on which host and understand the “known issues” for each version. Also, for vSAN things are even more complex as you could have “components” running on a different version of ESXi. On top of that, it could even be the case that a certain command or esxcli namespace is not available on a particular version of ESXi.

Another concern is when doing upgrades or updates, you need to take the current version into account when updating, or more importantly when upgrading! Also, remember that firmware/driver combination may be different for a particular version of vSphere/vSAN as well, this could also make life more complex and definitely increases the chances of making mistakes!

Is this documented anywhere? Yes, check out the following KB:

https://kb.vmware.com/s/article/2146381

vSAN Stretched Cluster failure scenarios and component votes

Duncan Epping · Apr 3, 2019 ·

I was at a customer last week and had an interesting question about the vSAN voting mechanism. This customer had a stretched cluster and used RAID-5 within each location to protect the data on top of replicating across locations. During certain failure scenarios unexpectedly the data remained available, of course, it is great that you have higher availability than expected, but why did this happen? What this customer tested was powering off the Witness (which is deemed as a site failure) and next powered of 2 hosts in 1 location, which exceeds the “failures to tolerate” in a single location. You would expect, based on all documentation so far, that the data would be unavailable. Well for some VMs this was the case, but for others, that was not the case. Why is this? Well, it is all about the vote count in this case. Look at the below diagram and the number of votes for each component first.

In the above scenario if the Witness (W) fails we have 4 votes less. Out of a total of 13 that is not a problem. If two additional hosts fail, this is most likely still not a problem, even though you are exceeding the provided “failures to tolerate”. However, if by any chance Host1 is one of those failed hosts then you would lose quorum. Host1 has a component with 2 votes. So if host1 has failed and the witness has failed and host2 for instance, you have now lost 7 out of 13 votes. This means quorum is lost. Please note that that single component with 2 votes is random. For a different VM/Object it could be that the component which is placed on host6 or host7 has 2 votes.

Another thing to point out, if host5-8 all would fail the data is still available. However, if then host3 and host4 would fail the object would become unavailable. Even though you still would have quorum across locations, you have now also exceeded the specified “failures to tolerate” within the location. This is also something that will be taken in to account.

I hope that helps.

How to test failure scenarios!

Duncan Epping · Mar 14, 2019 ·

Almost on a weekly basis, I get a question about unexpected results during the testing of certain failure scenarios. I usually ask first if there’s a diagram that shows the current configuration. The answer is usually no. Then I would ask if they have a failure testing matrix that describes the failures they are introducing, the expected result and the actual result. As you can guess, the answer is usually “euuh a what”? This is where the problem usually begins. The problem usually gets worse when customers try to mimic a certain failure scenario.

What would I do if I had to run through failure scenarios? When I was a consultant we always started with the following:

Document the environment, including all settings and the “why”
Create architectural diagrams
Discuss which types of scenarios would need to be tested
Create a failure testing matrix that includes the following:
- Type of failure
- How to create the scenario
  - Preferably include diagrams per scenario displaying where the failure is introduced
- Expected outcome
- Observed outcome

What I would normally also do is describe in the expected outcome section the theory around what should happen. Maybe I should just give an example of a failure and how I would describe it more or less.

Type Failure: Site Partition

How to: Disable links between Site-A / Site-C and Site-A / Site-B

Expected outcome: The secondary location will bind itself with the witness and will gain ownership over all components. In the preferred location, the quorum is lost, as such all VMs will appear as inaccessible. vSAN will terminate all VMs in the preferred location. This is from an HA perspective however a partition and not an isolation as all hosts in Site-A can still communicate with each other. In the secondary location vSphere HA will notice hosts are missing. It will validate that the VMs that were running are running, or not running. All VMs which are not running, and have accessible components, will be restarted in the secondary location.

Observed outcome: The observed outcome was similar to the expected outcome. It took 1 minute and 30 seconds before all 20 test VMs were restarted.

In the above example, I took a very basic approach and didn’t even go into the level of depth you probably should go. I would, for instance, include the network infrastructure as well and specify exactly where the failure occurs, as this will definitely help during troubleshooting when you need to explain why you are observing a particular unexpected behavior. In many cases what happens is that for instance a site partition is simulated by disabling NICs on a host, or by closing certain firewall ports, or by disabling a VLAN. But can you really compare that to a situation where the fiber between two locations is damaged by excavations? No, you can not compare those two scenarios. Unfortunately this happens very frequently, people (incorrectly) mimic certain failures and end up in a situation where the outcome is different than expected. Usually as a result of the fact that the failure being introduced is also different than the failure that was described. If that is the case, should you still expect the same outcome? You probably should not.

Yes I know, no one likes to write documentation and it is much more fun to test things and see what happens. But without recording the above, a successful implementation is almost impossible to guarantee. What I can guarantee though is that when something fails in production, you most likely will not see the expected behavior when you have not tested the various failure scenarios. So please take the time to document and test, it is probably the most important step of the whole process.

Changed advanced setting VSAN.ClomRepairDelay and upgrading to 6.7 u1? Read this…

Duncan Epping · Feb 6, 2019 ·

If you changed the advanced setting VSAN.ClomRepairDelay to anything else than the default 60 minutes there’s a caveat during the upgrade to 6.7 U1 you need to be aware of. When you do this upgrade the default is reset, meaning the value is configured once again to 60 minutes. It was reported on twitter by “Justin Bias” this week, and I tested in the lab and indeed experience the same behavior. I set my value to 90 and after an upgrade from 6.7 to 6.7 U1 the below was the result.

Why did this happen? Well, in vSAN 6.7 U1 we introduced a new global cluster-wide setting. On a cluster level under “Configure >> vSAN >> Services” you now have the option to set the “Object Repair Time” for the full cluster, instead of doing this on a host by host basis. Hopefully this will make your life a bit easier.

Note that when you make the change globally it appears that the Advanced Settings UI is not updated automatically. The change is however committed to the host, this is just a UI bug at the moment and will be fixed in a future release.

Free E-Book: Operationalizing VMware vSAN

Duncan Epping · Jan 17, 2019 ·

A while ago my colleague and friend Kevin Lees reached out to me and asked me if I could go over some material he wrote together with Paul Wiggett. He also asked me if I would be willing to write a foreword. When Kevin send the document over I literally finished it within a day. What I enjoyed most about this vSAN book was the fact that it wasn’t a deep dive, it wasn’t drilling down on technology, instead the people/process aspect of things are being discussed. This is an area which is often overlooked, and definitely an area that deserves more attention when people are looking to adopt software-defined storage, or the software-defined data center for that matter. Thanks Paul/Kevin for providing me the opportunity to write the foreword, I just downloaded my free copy and I have to say it looks great.

If you are interested, the book can be downloaded for free through the VMware Virtual Blocks blog, simply go here and download your copy.