• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

Yellow Bricks

by Duncan Epping

  • Home
  • ESXTOP
  • Stickers/Shirts
  • Privacy Policy
  • About
  • Show Search
Hide Search

stretched cluster

vSAN File Services and Stretched Clusters!

Duncan Epping · Mar 29, 2021 · Leave a Comment

As most of you probably know, vSAN File Services is not supported on a stretched cluster with vSAN 7.0 or 7.0U1. However, starting with vSAN 7.0 U2 we now fully support the use of vSAN File Services on a stretched cluster configuration! Why is that?

In 7.0 U2, you now have the ability to specify during configuration of vSAN File Services to which site certain IP addresses belong. In other words, you can specify the “site affinity” of your File Service services. This is shown in the screenshot below. Now I do want to note, this is a soft affinity rule. Meaning that if hosts, or VMs, fail on which these file services containers are running it could be that the container is restarted in the opposite location. Again, a soft rule, not a hard rule!

Of course, that is not the end of the story. You also need to be able to specify for each share with which location they have affinity. Again, you can do this during configuration (or edit it afterward if desired), and this basically then sets the affinity for the file share to a location. Or said differently, it will ensure that when you connect to file share, one of the file servers in the specified site will be used. Again, this is a soft rule, meaning that if none of the file servers are available on that site, you will still be able to use vSAN File Services,  just not with the optimized data path you defined.

Hopefully, that gives a quick overview of how you can use vSAN File Services in combination with a vSAN Stretched Cluster.  I created a video to demonstrate these new capabilities, you can watch it below.

vSAN 7.0 U2 now integrates with vSphere DRS

Duncan Epping · Mar 24, 2021 · 1 Comment

One of the features our team requested a while back was integration between DRS and vSAN. The key use case we had was for stretched clusters. Especially in scenarios where a failure has occurred, it would be useful if DRS would understand what vSAN is doing. What do I mean by that?

Today when customers create a stretched cluster they have two locations. Using vSAN terminology these locations are referred to as the Preferred Fault Domain and the Secondary Fault Domain. Typically when VMs are then deployed, customers will create VM-to-Host Affinity Rules which state that VMs should reside in a particular location. When these rules are created DRS will do its best to ensure that the defined rule is adhered to. What is the problem?

Well if you are running a stretched cluster and let’s say one of the sites go down, then what happens when the failed location returns for duty is the following:

  • vSAN detects the missing components are available again
  • vSAN will start the resynchronization of the components
  • DRS runs every minute and rebalances and will move VMs based on the DRS rules

This means that the VMs for which rules are defined will move back to their respective location, even though vSAN is potentially still resynchronizing the data. First of all, the migration will interfere with the replication traffic. Secondly, for as long as the resync has not completed, I/O will across the network between the two locations, this will not only interfere with resync traffic, it will also increase latency for those workloads. So, how does vSAN 7.0 U2 solve this?

Starting with vSAN 7.0 U2 and vSphere 7.0 U2 we now have DRS and vSAN communicating. DRS will verify with vSAN what the state is of the environment, and it will not migrate the VMs back as long the VMs are healthy again. When the VMs are healthy and the resync has completed, you will see the rules being applied and the VMs automatically migrate back (when DRS is configured to Fully Automated that is).

I can’t really show it with a screenshot or anything, as this is a change in the vSAN/DRS architecture, but to make sure it worked I recorded a quick demo which I published through Youtube. Make sure to watch the video!

VMs which are not stretched in a stretched cluster, which policy to use?

Duncan Epping · Dec 14, 2020 · 2 Comments

I’ve seen this question popping up regularly. Which policy setting (“site disaster tolerance” and “failures to tolerate”) should I use when I do not want to stretch my VMs? Well, that is actually pretty straight forward, in my opinion, you really only have two options you should ever use:

  • None – Keep data on preferred (stretched cluster)
  • None – Keep data on non-preferred (stretched cluster)

Yes, there is another option. This option is called “None – Stretched Cluster” and then there’s also “None – Standard Cluster”. Why should you not use these? Well, let’s start with “None – Stretched Cluster”. In the case of “None – Stretched Cluster”, vSAN will per object decide where to place it. As you hopefully know, a VM consists of multiple objects. As you can imagine, this is not optimal from a performance point of view, as you could end up having a VMDK being placed in Site A and a VMDK being placed in Site B. Which means it would read and write from both locations from a storage point of view, while the VM would be sitting in a single location from a compute point of view. It is also not very optimal from an availability stance, as it would mean that when the intersite link is unavailable, some objects of the VM would also become inaccessible. Not a great situation. What would it look like? Well, potentially something like the below diagram!

Then there’s “None – Standard Cluster”, what happens in this case? When you use “None – Standard Cluster” with “RAID-1”, what is going to happen is that the VM is configured with FTT=1 and RAID-1, but in a stretched cluster “FTT” does not exist, and FTT automatically will become PFTT. This means that the VM is going to be mirrored across locations, and you will have SFTT=0, which means no resiliency locally. It is the same as “Dual Site Mirroring”+”No Data Redundancy”!

In summary, if you ask me, “none – standard cluster” and “none – stretched cluster” should not be used in a stretched cluster.

vSphere HA reporting not enough failover resources fault with stretched cluster failure scenario

Duncan Epping · Nov 20, 2020 · Leave a Comment

Last few months I had a couple of customers asking why vSphere HA was reporting “not enough failover resources” fault in a stretched cluster failure scenario for virtual machines that are still up and running. Now before I explain why, let’s paint a picture first to make it clear what is happening here. When you run a stretched cluster you can have a scenario where a particular VM (or multiple VMs) are not mirrored/replicated across locations. Now note, with vSAN you can specify for any given VM on a VM level how and if the VM should be available across locations. Typically you would see a VM with RAID-1 across locations, and then RAID-1/5/6 within a location. However, you can also have a scenario where a VM is not replicated across locations, but from a storage point of view only available within a location, this is depicted in the diagram below.

Now in this scenario, when Site A is somehow partitioned from Site B, you will see alarms/errors which indicate that vSphere HA has tried to restart the VM that is located in Site B in Site A and that is has failed as a result of not having enough failover resources.

This, of course, is not the result of not having sufficient failover resources, but it is the result of the fact that Site A does not have access to the required storage components to restart the VM. Basically what HA is reporting is that it doesn’t have the resources which have the ability to restart the impacted VM(s).

Now, if you have paid attention, you will probably wonder why HA tries to restart the VM in the first place, as the VM will still be running in this scenario. Why is it still running? Well the VM isn’t stretched, and this is a partition and not an isolation, which means the isolation response doesn’t kill the VM. So why restart it? Well, as Site A is partitioned from Site B, Site A does not know what the status is of Site B. Site A only knows that Site B is not responding at all, and the only thing it can do is assume the full site has failed. As a result it will attempt a failover for all VMs that were/are running in Site B and were protected by vSphere HA.

Hope that explains why this happens. If you are not sure you understand the full scenario, I recorded a quick five minute video actually walking through the scenario and explaining what happens. You can watch that below, or simply go to youtube.

Creating a vSAN 7.0 Stretched Cluster

Duncan Epping · Mar 31, 2020 · Leave a Comment

A while ago I wrote this article and created this demo showing the creation of a vSAN Stretched Cluster. I bumped into the article/video today when I was looking for something and figured that it is time to recreate the vSAN Stretched Cluster demo. I recorded it today, using vSphere / vSAN 7, and just wanted to share it with you here. Hope you enjoy it.

  • Go to page 1
  • Go to page 2
  • Go to page 3
  • Interim pages omitted …
  • Go to page 7
  • Go to Next Page »

Primary Sidebar

About the author

Duncan Epping is a Chief Technologist in the Office of CTO of the Cloud Platform BU at VMware. He is a VCDX (# 007) and the author of the "vSAN Deep Dive" and the “vSphere Clustering Technical Deep Dive” series.

Upcoming Events

14-Apr-21 | VMUG Italy – Roadshow
26-May-21 | VMUG Egypt – Roadshow
May-21 | Australian VMUG – Roadshow

Recommended reads

Sponsors

Want to support us? Buy an advert!

Advertisements

Copyright Yellow-Bricks.com © 2021 · Log in