This question is going to come sooner or later, how do I configure HA/DRS when I am running a Virtual SAN Stretched cluster configuration. I described some of the basics of Virtual SAN stretched clustering in a what’s new for 6.1 post already, if you haven’t read it then I urge you to do so first. There are a couple of key things to know, first of all the latency between data sites that can be tolerated is 5ms and to the witness location ~100ms.
If you look at the picture you below you can imagine that when a VM sits in Fault Domain A and is reading from Fault Domain B that it could incur a latency of 5ms for each read IO. From a performance perspective we would like to avoid this 5ms latency, so for stretched clusters we introduce the concept of read locality. We don’t have this in a non-stretched environment, as there the latency is microseconds and not miliseconds. Now this “read locality” is something we need to take in to consideration when we configure HA and DRS.
When it comes to vSphere HA the configuration is straight forward. We need to ensure we have a “local” isolation address for both “active data sites”, so that in the case a site partition occurs and one of the hosts is isolated we know it can ping a local address. This can simply be achieved by adding the following advanced settings to vSphere HA:
- das.isolationaddress0 = 192.168.1.1
- das.isolationaddress1 = 192.168.2.1
When you do this I would also recommend disabling the use of the default gateway as an isolation address. You can simply do this by setting the following advanced setting:
So far, straight forward. The Isolation Response I would recommend to set to “power off and restart VM“. Main reason for it being that the chances are extremely high that when the HA network is isolated that your storage (VSAN) network is also isolated and the VM won’t be able to access its disk any longer.
I also would recommend to configure HA to respect the VM/Host affinity rules. This can simply be done in the UI these days, the option is called “Respect VM to Host affinity rules during failover”. Make sure this is enabled. And of course there is Admission Control. Make sure that it is set to 50% for CPU and 50% for memory, anything lower than that means you risk VMs not being restarted during a full-site-failure. Note that it is no resource guarantee, just a restart guarantee!
Next stop, DRS. DRS configuration is straight forward, it can be configured to “fully automated“. In order to ensure that you do not incur a max of 5ms latency for a read, we highly recommend to implement VM/Host should rules for each of the sites. Create a group for each site with all the hosts in that particular site, and do the same for the other site. Now make a grouping for the VMs which will be tied to those sites, 50% of the VMs in each of the sites respectively. Also, make sure that when you define the rule it is a “should” rule. A “must” rule will not be violated, and that can result in an unpleasant situation when you need to do a full site fail-over. When there is a full site failure or a partition and the failures has been lifted, I would recommend placing the DRS configuration to “partially automated” during the outage so that VMs are not migrated back during the “resynchronisation” stage when the failure has been resolved. I would prefer to move them back after resyncing has been completed. Note that when you migrate the VMs back to the original site, the cache will need to be rewarmed, and as such I would prefer to migrate the VMs during maintenance hours. Although the performance impact won’t be huge, and I can also see why people would want to move back to the original site as soon as it is up, it is no requirement so that is up to you to decide.
All fairly straight forward and very similar to the vMSC configuration for 6.x. Only thing that is different is that with Virtual SAN you don’t configure VM Component Protection, which is something you need to do for vMSC. All in all, VSAN Stretched Clustering is fairly straight forward, and can literally be configured in minutes, and that includes HA and DRS!