virtual san

Disable VSAN site locality in low latency stretched cluster

Duncan Epping · Jan 15, 2016 ·

This week I was talking to a customer in Germany who had deployed a VSAN stretched cluster within a building. As it was all within a building (extremely low latency) and they preferred to have a very simple operational model they decided not to implement any type of VM/Host rules. By default when a stretched cluster is deployed in VSAN (and ROBO uses this workflow as well) then “site locality” is implemented for caching. This means that a VM will have its read cache on the host which holds the components in the site where it is located.

This is great of course and avoids incurring latency hit for reads. Now in some cases you may not desire this behaviour. For instance in the situation above where there is an extremely low latency connection between the different rooms in the same building. In this case as well because there are no VM/Host rules implemented a VM can freely roam around the cluster. Now when a VM moves between VSAN Fault Domains in this scenario the cache will need to be rewarmed as it only reads locally. Fortunately you can disable this behaviour easily through the advanced setting called DOMOwnerForceWarmCache:

[root@esxi-01:~] esxcfg-advcfg -g /VSAN/DOMOwnerForceWarmCache

Value of DOMOwnerForceWarmCache is 0

[root@esxi-01:~] esxcfg-advcfg -s 1 /VSAN/DOMOwnerForceWarmCache

Value of DOMOwnerForceWarmCache is 1

In a stretched environment you will see that this setting is set to 0 set it to 1 to disable this behaviour. In a ROBO environment VM migrations are uncommon, but if they do happen on a regular basis you may also want to look in to setting this setting.

Jumbo Frames and VSAN Stretched Cluster configurations

Duncan Epping · Dec 22, 2015 ·

I received a question last week from a customer who had implemented a stretched VSAN cluster. The Health Check after the implementation indicated that there was an “issue” with the MTU configuration. The customer had explained that he had configured an MTU of 9000 between the two data sites and an MTU of (default) 1500 between data sites and the witness.

The question of course was, why the Health Check indicated there was an issue. The problem here is that witness traffic and data in todays version of Virtual SAN use the same VMkernel interface. If the VSAN VMkernel interface on the the “data” site is configured for 9000 and one the “witness” site is configured for 1500 then there is a mismatch which causes fragmentation etc. This is what the health check calls out. VSAN (and the health check as such) expects an “end-to-end” consistently configured MTU, even in a stretched environment.

VSAN VROps Management Pack version 6.0.3 available

Duncan Epping · Dec 17, 2015 ·

On the 15th the VROps Management Pack for VSAN 6.0.3 was released. If you have VROps Standard or higher you can take advantage of this management pack. It is supported for the latest release of VSAN, 6.1, as of this management pack officially. Very useful to find out if there are any anomalies and what the trends are. I’ve always loved VROps and it just became even more useful to me!

For those who want even more info, there is also a Log Insight Content Pack for VSAN available, which can give you some great insights on what is going on within your VSAN environment. For instance when there is congestion as shown in the screenshot below, which I borrowed from Cormac.

Removing stretched VSAN configuration?

Duncan Epping · Dec 15, 2015 ·

I had a question today around how to safely remove a stretched VSAN configuration without putting any of the workloads in danger. This is fairly straight forward to be honest, there are 1 or 2 things though which are important. (For those wondering why you would want to do this, some customers played with this option and started loading workloads on top of VSAN and then realized it was still running in stretched mode.) Here are the steps required:

Click on your VSAN cluster and go to Manage and disable the stretched configuration
- This will remove the witness host, but will leave 2 fault domains in tact
Remove the two remaining fault domains
Go to the Monitor section and click on Health and check the “virtual san object health”. Most likely it will be “red” as the “witness components” have gone missing. VSAN will repair this automatically by default in 60 minutes. We prefer to take step 4 though asap after removing the failure domains!
Click “repair object immediately”, now witness components will be recreated and the VSAN cluster will be healthy again.
Click “retest” after a couple of minutes

By the way, that “repair object immediately” feature can also be used in the case of a regular host failure where “components” have gone absent. Very useful feature, especially if you don’t expect a host to return any time soon (hardware failure for instance) and have the spare capacity.

VSAN Healthcheck Plugin requires DRS??

Duncan Epping · Dec 11, 2015 ·

I had a question today on my blog from a user who said that the VSAN Healthcheck Plugin was great but unfortunately required DRS to be able to install/configure it, which means that if you have vSphere Standard you can’t use. A very valid point, at least for the first version of the healthcheck plugin. However that problem has been fixed for a while now. I haven’t seen anyone pointing it out so I figured I would write a couple of lines about it for those who want to use it as I figure that more have hit this problem with the first release of the healthcheck plugin and haven’t seen that it has been fixed yet.

As of VSAN Healthcheck Plugin version 6.0.1 it is no longer required to have DRS enabled (this was a bug). You can find the link to download version 6.0.1 below:

Download 6.0.1 – https://my.vmware.com/web/vmware/details?downloadGroup=VSANHEALTH600&productId=492
Release notes 6.0.1 – https://www.vmware.com/support/vsphere6/doc/vmware-virtual-san-healthcheck-601-release-notes.html

For those who aren’t using the Healthcheck yet and are running vSphere 6.0, it is highly recommended! With the newer versions of vSphere 6.0 it will always come included (U1 and up). It has some great health checks that will enable you to validate the state of your VSAN cluster in a simple overview. I personally find the pro-active tests very valuable, especially the “burn in/perf” type tests, and of course the multi-cast test.

And there is more coming pretty soon, been testing the next version of this in my lab and I must say that it looks great. Having all of the perf stats straight in the Web Client is definitely making life easier. Hopefully it is out soon!