BC-DR

vSphere HA deepdive 6.x available for free!

Duncan Epping · Feb 18, 2016 ·

I’ve been discussing this over the last 12 months with Frank, and to be honest we are still not sure what is the right thing to do but we decided to take this step anyway. Over the past couple of years we released various updates of the vSphere Clustering Deepdive. Updating the book sometimes was a massive pain (version 4 to 5 for instance), but some of the minor updates have been relative straight forward, although still time consuming due to formatting / diagrams / screenshots etc.

Ever since we’ve been looking for new ways to distribute our book, or publication as I will refer to it from now on. I’ve looked at various options, and found one which I felt was the best of all worlds: Gitbook. Gitbook is a solution which allows you as an author to develop content in Markdown and distribute it in various different formats. This could be as static html, pdf, ePub or Mobi. Basically any format you would want in this day and age. The great thing about the platform as well is that it integrates with Github and you can share your source there and do things like version control etc. It does it in such a way that I can use the Gitbook client on my Mac, while someone else who wants to contribute or submit a change can simply use their client of choice and submit a change through git. Although I don’t expect too many people to do this, it will make it easier for me to have material reviewed for instance by one of the VMware engineers.

So what did I just make available for free? Well in short, an updated version (vSphere 6.0 Update 1) of the vSphere HA Deepdive. This includes the stretched clustering section of the book. Note that DRS and SDRS have not been included (yet). This may or may not happen in some shape or form in the future though. For now, I hope you will enjoy and appreciate the content that I made available for free. You can access it by clicking “HA Deepdive” on the left, or (in my opinion) for a better reading experience read it on Gitbook directly through this link: ha.yellow-bricks.com.

Note that there are links as well to download the content in different formats, for those who want to read it on their iPad / phone / whatever. Also note that Gitbook allows you to comment on a paragraph by clicking the “+” sign on the right side of the paragraph when you hover over it… Please submit feedback when you see mistakes! And for those who are really active, if you want to you could even contribute to the content! I will keep updating the content over the upcoming months probably with more info on VVols and for instance the Advanced Settings, so keep checking back regularly!

SMP-FT and (any type of) stretched storage support

Duncan Epping · Jan 19, 2016 ·

I had a question today around support for SMP-FT in an EMC VPLEX environment. It is well known that SMP-FT isn’t supported in a stretched VSAN environment, but what about other types of stretched storage? Is that a VSAN specific constraint? (Legacy) FT appears to be supported for VPLEX and other types of stretched storage?

SMP-FT is not supported in a vSphere Metro Storage Cluster environment either! This has not been qualified yet, I’ve requested the FT team to at least put it up on the roadmap and document max latency tolerated for these types of environments for SMP-FT just in case someone would want to use it in a campus situations for instance, despite the high bandwidth requirements for SMP-FT. Note that “legacy FT” can be used with vMSC environment, but not with VSAN. In order to use legacy FT (single vCPU) you will need to use an advanced VM setting: vm.uselegacyft. Make sure to set this setting when using FT in a stretched environment!

Disable VSAN site locality in low latency stretched cluster

Duncan Epping · Jan 15, 2016 ·

This week I was talking to a customer in Germany who had deployed a VSAN stretched cluster within a building. As it was all within a building (extremely low latency) and they preferred to have a very simple operational model they decided not to implement any type of VM/Host rules. By default when a stretched cluster is deployed in VSAN (and ROBO uses this workflow as well) then “site locality” is implemented for caching. This means that a VM will have its read cache on the host which holds the components in the site where it is located.

This is great of course and avoids incurring latency hit for reads. Now in some cases you may not desire this behaviour. For instance in the situation above where there is an extremely low latency connection between the different rooms in the same building. In this case as well because there are no VM/Host rules implemented a VM can freely roam around the cluster. Now when a VM moves between VSAN Fault Domains in this scenario the cache will need to be rewarmed as it only reads locally. Fortunately you can disable this behaviour easily through the advanced setting called DOMOwnerForceWarmCache:

[root@esxi-01:~] esxcfg-advcfg -g /VSAN/DOMOwnerForceWarmCache

Value of DOMOwnerForceWarmCache is 0

[root@esxi-01:~] esxcfg-advcfg -s 1 /VSAN/DOMOwnerForceWarmCache

Value of DOMOwnerForceWarmCache is 1

In a stretched environment you will see that this setting is set to 0 set it to 1 to disable this behaviour. In a ROBO environment VM migrations are uncommon, but if they do happen on a regular basis you may also want to look in to setting this setting.

Where do I run my VASA Vendor Provider for vVols?

Duncan Epping · Jan 6, 2016 ·

I was talking to someone before the start of the holiday season about running the Vendor Provider (VP) for vVols as a VM and what the best practices are around that. I was thinking about the implication of the VP not being available and came to the conclusion that when the VP is unavailable a bunch of things stop working out of which “bind” is probably most important.

The “bind” operation is what allows vSphere to access a given Virtual Volume (vVol), and this operation is issued during a power-on of a VM. This is how the vVols FAQ describes it:

When a vVol is created, it is not immediately accessible for IO. To Access vVol, vSphere needs to issue a “Bind” operation to a VASA Provider (VP), which creates IO access point for a vVol on a Protocol Endpoint (PE) chosen by a VP. A single PE can be the IO access point for multiple vVolss. “Unbind” Operation will remove this IO access point for a given vVol.

This means that when the VP is unavailable, you can’t power-on VMs at that particular time. For many storage systems that problem is mitigated by having the VP as part of their storage system itself, and of course there is the option to have multiple VPs as part of your solution, either in active/active or in active/standby configuration. In the case of VSAN for instance, each host has a VASA provider out of which one is active and others are standby, if the active fails the standby will take over automatically. So to be clear, it is up to the vendor to decide what type of availability to provide for the VP, some have decided to go for a single instance and rely on vSphere HA to restart the appliance, others have created active/standby etc.

But back to vVols, what if you own a storage system that requires an external VASA VP as a VM?

Run your VP VMs in a management cluster, if the hosts in the “production” cluster are impacted and VMs are restarted then at least the VP VMs should be up and running in your management cluster
Use multiple VP VMs if and when possible, if active/active or active / standby is supported make sure to run your VPs in that configuration
Do not use vVols for the VP itself, you don’t want to have any (circular) dependency between the availability of the VP and being able to power-on the VP itself
If there is no availability story for the VP, depending on the configuration of the appliance vSphere FT should be considered.

One more thing, if you are considering buying new storage, I think one question you definitely need to ask your vendor is what their story is around the VP. Is it a VM or is it part of the storage system itself? Is there an availability story for the VP, and if so is this “active/active” or “active/standby”? If not, what do they have on their roadmap around this? You are probably also asking yourself what VMware has planned to solve this problem, well there are a couple of things cooking and I can’t say too much about it. One important effort though is the inclusion of bind/unbind in the T10 SCSI standard, but as you can imagine, those things take time. (Which would allow us to power-on new VMs even when the VP is unavailable as it would be a SCSI command.) Until then, when you design a vVol environment, take the above into account when it comes to your Vendor Provider aka VP!

Removing stretched VSAN configuration?

Duncan Epping · Dec 15, 2015 ·

I had a question today around how to safely remove a stretched VSAN configuration without putting any of the workloads in danger. This is fairly straight forward to be honest, there are 1 or 2 things though which are important. (For those wondering why you would want to do this, some customers played with this option and started loading workloads on top of VSAN and then realized it was still running in stretched mode.) Here are the steps required:

Click on your VSAN cluster and go to Manage and disable the stretched configuration
- This will remove the witness host, but will leave 2 fault domains in tact
Remove the two remaining fault domains
Go to the Monitor section and click on Health and check the “virtual san object health”. Most likely it will be “red” as the “witness components” have gone missing. VSAN will repair this automatically by default in 60 minutes. We prefer to take step 4 though asap after removing the failure domains!
Click “repair object immediately”, now witness components will be recreated and the VSAN cluster will be healthy again.
Click “retest” after a couple of minutes

By the way, that “repair object immediately” feature can also be used in the case of a regular host failure where “components” have gone absent. Very useful feature, especially if you don’t expect a host to return any time soon (hardware failure for instance) and have the spare capacity.