vmsc

How to test failure scenarios!

Duncan Epping · Mar 14, 2019 ·

Almost on a weekly basis, I get a question about unexpected results during the testing of certain failure scenarios. I usually ask first if there’s a diagram that shows the current configuration. The answer is usually no. Then I would ask if they have a failure testing matrix that describes the failures they are introducing, the expected result and the actual result. As you can guess, the answer is usually “euuh a what”? This is where the problem usually begins. The problem usually gets worse when customers try to mimic a certain failure scenario.

What would I do if I had to run through failure scenarios? When I was a consultant we always started with the following:

Document the environment, including all settings and the “why”
Create architectural diagrams
Discuss which types of scenarios would need to be tested
Create a failure testing matrix that includes the following:
- Type of failure
- How to create the scenario
  - Preferably include diagrams per scenario displaying where the failure is introduced
- Expected outcome
- Observed outcome

What I would normally also do is describe in the expected outcome section the theory around what should happen. Maybe I should just give an example of a failure and how I would describe it more or less.

Type Failure: Site Partition

How to: Disable links between Site-A / Site-C and Site-A / Site-B

Expected outcome: The secondary location will bind itself with the witness and will gain ownership over all components. In the preferred location, the quorum is lost, as such all VMs will appear as inaccessible. vSAN will terminate all VMs in the preferred location. This is from an HA perspective however a partition and not an isolation as all hosts in Site-A can still communicate with each other. In the secondary location vSphere HA will notice hosts are missing. It will validate that the VMs that were running are running, or not running. All VMs which are not running, and have accessible components, will be restarted in the secondary location.

Observed outcome: The observed outcome was similar to the expected outcome. It took 1 minute and 30 seconds before all 20 test VMs were restarted.

In the above example, I took a very basic approach and didn’t even go into the level of depth you probably should go. I would, for instance, include the network infrastructure as well and specify exactly where the failure occurs, as this will definitely help during troubleshooting when you need to explain why you are observing a particular unexpected behavior. In many cases what happens is that for instance a site partition is simulated by disabling NICs on a host, or by closing certain firewall ports, or by disabling a VLAN. But can you really compare that to a situation where the fiber between two locations is damaged by excavations? No, you can not compare those two scenarios. Unfortunately this happens very frequently, people (incorrectly) mimic certain failures and end up in a situation where the outcome is different than expected. Usually as a result of the fact that the failure being introduced is also different than the failure that was described. If that is the case, should you still expect the same outcome? You probably should not.

Yes I know, no one likes to write documentation and it is much more fun to test things and see what happens. But without recording the above, a successful implementation is almost impossible to guarantee. What I can guarantee though is that when something fails in production, you most likely will not see the expected behavior when you have not tested the various failure scenarios. So please take the time to document and test, it is probably the most important step of the whole process.

New whitepaper available: vSphere Metro Storage Cluster Recommended Practices (6.5 update)

Duncan Epping · Oct 24, 2017 ·

I had many requests for an updated version of this paper, so the past couple of weeks I have been working hard. The paper was outdated as it was last updated around the vSphere 6.0 timeframe, and it was only a minor update. I looked at every single section and added in new statements and guidance around vSphere HA Restart Priority for instance. So for those running a vSphere Metro Storage Cluster / Stretched Cluster of some kind, please read the brand new vSphere Metro Storage Cluster Recommended Practices (6.5 update) white paper.

It is available on storagehub.vmware.com in PDF and for reading within your browser. Any questions and comments, please do not hesitate to leave them here.

VMs not getting killed after vMSC partition has lifted

Duncan Epping · Jan 12, 2017 ·

I was talking to a VMware partner over the past couple of weeks about challenges they had in a new vSphere Metro Storage Cluster (vMSC) environment. In their particular case they simulated a site partition. During the site partition three things were expected to happen:

VMs that were impacted by APD (or PDL) should be killed by vSphere HA Component Protection
- If HA Component Protection does not work, vSphere should kill the VMs when the partition is lifted
VMs should be restarted by vSphere HA

The problems faced were two-fold, VMs were restarted by vSphere HA, however:

vSphere HA Component Protection did not kill the VMs
When the partition was lifted vSphere did not kill the VMs which had lost the lock to the datastore either

It took a while before we figured out what was going on, at least for one of the problems. Lets start with the second problem first, why aren’t the VMs killed when the partition is lifted? vSphere should do this automatically. Well vSphere does this automatically, but only when there’s a Guest Operating system installed and an I/O is issued. As soon as an I/O is issued by the VM then vSphere will notice the lock to the disk is lost and obtained by another host and kill the VM. If you have an “empty VM” then this won’t happen as there will not be any I/O to the disk. (I’ve filed a feature request to kill VMs as well even without disk I/O or without a disk.) So how do you solve this? If you do any type of vSphere HA testing (with or without vMSC) make sure to install a guest OS so it resembles real life.

Now back to the first problem. The fact that vSphere HA Component Protection does not kick in is still being debated, but I think there is a very specific reason for it. vSphere HA Component Protection is a feature that kills VMs on a host so they can be restarted when an APD or a PDL scenario has occurred. However, it will only do this when it is:

Certain the VM can be restarted on the other side (conservative setting)
There are healthy hosts in the other partition, or we don’t know (Aggressive)

First one is clear I guess (more info about this here), but what does the second one mean? Well basically there are three options:

Availability of healthy host: Yes >> Terminate
Availability of healthy host: No >> Don’t Terminate
Availability of healthy host: Unknown >> Terminate

So in the case you where you have VMCP set to “Aggressively” failover VMs, it will only do so when it knows hosts are available in the other site or when it does not know the state of the hosts in the other site. If for whatever reason the hosts are deemed as unhealthy the answer to the question if there are healthy hosts available or not will be “No”, and as such the VMs will not be killed by VMCP. The question remains, why are these hosts reported as “unhealthy” in this partition scenario, that is something we are now trying to figure out. Potentially it could be caused by misconfigured Heartbeat Datastores, but this is still something to be confirmed. If I know more, I will update this article.

Just received confirmation from development, heartbeat datastores need to be available on both sites for vSphere HA to identify this scenario correctly. If there are no heartbeat datastores available on both sites then it could happen that no hosts are marked as healthy, which means that VMCP will not instantly kill those VMs when the APD has occured.

vMSC and Disk.AutoremoveOnPDL on vSphere 6.x and higher

Duncan Epping · Mar 21, 2016 ·

I have discussed this topic a couple of times, and want to inform people about a recent change in recommendation. In the past when deploying a stretched cluster (vMSC) it was recommended by most storage vendors and by VMware to set Disk.AutoremoveOnPDL to 0. This basically disabled the feature that automatically removes LUNs which are in a PDL (permanent device loss) state. Upon return of the device a rescan would then allow you to use the device again. With vSphere 6.0 however there has been a change to how vSphere responds to a PDL scenario, vSphere does not expect the device to return. To be clear, the PDL behaviour in vSphere was designed around the removal of devices, they should not stay in the PDL state and return for duty, this did work however in previous version due to a bug.

With vSphere 6.0 and higher VMware recommends to set Disk.AutoremoveOnPDL to 1, which is the default setting. If you are a vMSC / stretched cluster customer, please change your environment and design accordingly. But before you do, please consult your storage vendor and discuss the change. I would also like to recommend testing the change and behaviour to validate that the environment returns for duty correctly after a PDL! Sorry about the confusion.

KB article backing my recommendation was just posted: https://kb.vmware.com/kb/2059622. Documentation (vMSC whitepaper) is also being updated.

SMP-FT and (any type of) stretched storage support

Duncan Epping · Jan 19, 2016 ·

I had a question today around support for SMP-FT in an EMC VPLEX environment. It is well known that SMP-FT isn’t supported in a stretched VSAN environment, but what about other types of stretched storage? Is that a VSAN specific constraint? (Legacy) FT appears to be supported for VPLEX and other types of stretched storage?

SMP-FT is not supported in a vSphere Metro Storage Cluster environment either! This has not been qualified yet, I’ve requested the FT team to at least put it up on the roadmap and document max latency tolerated for these types of environments for SMP-FT just in case someone would want to use it in a campus situations for instance, despite the high bandwidth requirements for SMP-FT. Note that “legacy FT” can be used with vMSC environment, but not with VSAN. In order to use legacy FT (single vCPU) you will need to use an advanced VM setting: vm.uselegacyft. Make sure to set this setting when using FT in a stretched environment!