stretched cluster

Removing stretched VSAN configuration?

Duncan Epping · Dec 15, 2015 ·

I had a question today around how to safely remove a stretched VSAN configuration without putting any of the workloads in danger. This is fairly straight forward to be honest, there are 1 or 2 things though which are important. (For those wondering why you would want to do this, some customers played with this option and started loading workloads on top of VSAN and then realized it was still running in stretched mode.) Here are the steps required:

Click on your VSAN cluster and go to Manage and disable the stretched configuration
- This will remove the witness host, but will leave 2 fault domains in tact
Remove the two remaining fault domains
Go to the Monitor section and click on Health and check the “virtual san object health”. Most likely it will be “red” as the “witness components” have gone missing. VSAN will repair this automatically by default in 60 minutes. We prefer to take step 4 though asap after removing the failure domains!
Click “repair object immediately”, now witness components will be recreated and the VSAN cluster will be healthy again.
Click “retest” after a couple of minutes

By the way, that “repair object immediately” feature can also be used in the case of a regular host failure where “components” have gone absent. Very useful feature, especially if you don’t expect a host to return any time soon (hardware failure for instance) and have the spare capacity.

Virtual SAN Stretched Clustering demo

Duncan Epping · Sep 10, 2015 ·

I posted about HA/DRS settings for VSAN Stretched Clustering yesterday and posted an intro to 6.1 and all new functionality which includes stretched clustering. As part of our VMworld session Rawlinson Rivera recorded a nice demo. We figured we should share it with the world, so I added the voice-over so at least it is clear what you are looking at and why certain things are configured in a specific way. I hope this demo shows how dead simple it is to configure VSAN stretched clustering, and how it handles a full site failure. Enjoy,

HA/DRS configuration with Virtual SAN Stretched Cluster environment

Duncan Epping · Sep 9, 2015 ·

This question is going to come sooner or later, how do I configure HA/DRS when I am running a Virtual SAN Stretched cluster configuration. I described some of the basics of Virtual SAN stretched clustering in a what’s new for 6.1 post already, if you haven’t read it then I urge you to do so first. There are a couple of key things to know, first of all the latency between data sites that can be tolerated is 5ms and to the witness location ~100ms.

If you look at the picture you below you can imagine that when a VM sits in Fault Domain A and is reading from Fault Domain B that it could incur a latency of 5ms for each read IO. From a performance perspective we would like to avoid this 5ms latency, so for stretched clusters we introduce the concept of read locality. We don’t have this in a non-stretched environment, as there the latency is microseconds and not miliseconds. Now this “read locality” is something we need to take in to consideration when we configure HA and DRS.

VMworld Virtual SAN slidedecks up on slideshare

Duncan Epping · Sep 9, 2015 ·

I just posted the slidedecks that I presented at VMworld on Virtual SAN up on slideshare. The recording and the slides will probably at some point also show up on vmware.com but as I had many requests from people to share the material I figured I would do that straight after the event. If you have any questions don’t hesitate to ask.

Building a Stretched Cluster with Virtual SAN 6.1

Five common customer use cases for Virtual SAN

vMSC for 6.0, any new recommendations?

Duncan Epping · Apr 15, 2015 ·

I am currently updating the vSphere Metro Storage Cluster best practices white paper, over the last two weeks I received various questions if there were any new recommendation for vMSC for 6.0. I have summarized the recommendations below for your convenience, the white paper is being reviewed and I am updating screenshots, hopefully will be done soon.

In order to allow vSphere HA to respond to both an APD and a PDL condition vSphere HA needs to be configured in a specific way. VMware recommends enabling VM Component Protection. After the creation of the cluster VM Component Protection needs to be enabled.
The configuration for PDL is basic. In the “Failure conditions and VM response” section it can be configured what the response should be after a PDL condition is detected. VMware recommends setting this to “Power off and restart VMs”. When this condition is detected a VM will be restarted instantly on a healthy host within the vSphere HA cluster.
When an APD condition is detected a timer is started. After 140 seconds the APD condition is officially declared and the device is marked as APD time out. When the 140 seconds has passed HA will start counting, the default HA time out is 3 minutes. When the 3 minutes has passed HA will restart the impacted virtual machines, but you can configure VMCP to respond differently if desired. VMware recommends configuring it to “Power off and restart VMs (conservative)”.
- Conservative refers to the likelihood of HA being able to restart VMs. When set to “conservative” HA will only restart the VM that is impacted by the APD if it knows another host can restart it. In the case of “aggressive” HA will try to restart the VM even if it doesn’t know the state of the other hosts, which could lead to a situation where your VM is not restarted as there is no host that has access to the datastore the VM is located on.
It is also good to know that if the APD is lifted and access to the storage is restored before the time-out has passed that HA will not unnecessarily restart the virtual machine, unless you explicitly configure it do so. If a response is desired even when the environment has recovered from the APD condition then “Response for APD recovery after APD timeout” should be configured to “Reset VMs”. VMware recommends leaving this setting disabled.