• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

Yellow Bricks

by Duncan Epping

  • Home
  • ESXTOP
  • Stickers/Shirts
  • Privacy Policy
  • About
  • Show Search
Hide Search

Storage

VMs which are not stretched in a stretched cluster, which policy to use?

Duncan Epping · Dec 14, 2020 · 2 Comments

I’ve seen this question popping up regularly. Which policy setting (“site disaster tolerance” and “failures to tolerate”) should I use when I do not want to stretch my VMs? Well, that is actually pretty straight forward, in my opinion, you really only have two options you should ever use:

  • None – Keep data on preferred (stretched cluster)
  • None – Keep data on non-preferred (stretched cluster)

Yes, there is another option. This option is called “None – Stretched Cluster” and then there’s also “None – Standard Cluster”. Why should you not use these? Well, let’s start with “None – Stretched Cluster”. In the case of “None – Stretched Cluster”, vSAN will per object decide where to place it. As you hopefully know, a VM consists of multiple objects. As you can imagine, this is not optimal from a performance point of view, as you could end up having a VMDK being placed in Site A and a VMDK being placed in Site B. Which means it would read and write from both locations from a storage point of view, while the VM would be sitting in a single location from a compute point of view. It is also not very optimal from an availability stance, as it would mean that when the intersite link is unavailable, some objects of the VM would also become inaccessible. Not a great situation. What would it look like? Well, potentially something like the below diagram!

Then there’s “None – Standard Cluster”, what happens in this case? When you use “None – Standard Cluster” with “RAID-1”, what is going to happen is that the VM is configured with FTT=1 and RAID-1, but in a stretched cluster “FTT” does not exist, and FTT automatically will become PFTT. This means that the VM is going to be mirrored across locations, and you will have SFTT=0, which means no resiliency locally. It is the same as “Dual Site Mirroring”+”No Data Redundancy”!

In summary, if you ask me, “none – standard cluster” and “none – stretched cluster” should not be used in a stretched cluster.

vSphere HA configuration for HCI Mesh!

Duncan Epping · Oct 29, 2020 · 7 Comments

I wrote a vSAN HCI Mesh Considerations blog post a few weeks ago. Based on that post I received some questions, and one of the questions was around vSphere HA configurations. Interestingly I also had some internal discussions around how vSAN HCI Mesh and HA were integrated. Based on the discussions I did some testing just to validate my understanding of the implementation.

Now when it comes to vSphere HA and vSAN the majority of you will be following the vSAN Design Guide and understand that having HA enabled is crucial for vSAN. Also when it comes to vSAN configuring the Isolation Response is crucial, and of course setting the correct Isolation Address. However, so far there’s been an HA feature which you did not have to configure for vSAN and HA to function correctly, and that feature is VM Component Protection aka APD / PDL responses.

Now, this changes with HCI Mesh. Specifically for HCI Mesh the HA and vSAN team have worked together to detect APD (all paths down) down scenarios! When would this happen? Well if you look at the below diagram you can see that we have “Client Clusters” and a “Server Cluster”. The “Client Cluster” consumes storage from the “Server Cluster”. If for whatever reason a host in the “Client Cluster” loses access to the “Server Cluster”, it results in the VMs on that host consuming storage on the “Server Cluster” to lose access to the datastore. This is essentially an APD (all paths down) scenario.

Now, to ensure the VMs are protected by HA for this situation you only need to enable the APD response. This is very straight-forward. You simply go to the HA cluster settings and set the “Datastore with APD” setting to either “Power off and restart VMs – Conservative” or “Power off and restart VMs – Aggressive”. The difference between conservative and aggressive is that with conservative HA will only kill the VMs when it knows for sure the VMs can be restarted, wherewith aggressive it will also kill the VMs on a host impacted by an APD while it isn’t sure it can restart the VMs. Most customers will use the “Conservative Restart Policy” by the way.

As I also mentioned in the HCI Mesh Considerations blog, one thing I would like to call out is the timing for the APD scenario: The APD is declared after 60 seconds, after which the APD response (restart) is triggered automatically after 180 seconds. Mind that this is different than with an APD response with traditional storage, as with traditional storage it will take 140 seconds before the APD is declared. You can, of course, in the log file see that an APD is detected, declared and VMs are killed as a result. Note that the “fdm.log” is quite verbose, so I copied only the relevant lines from my tests.

APD detected for remote vSAN Datastore /vmfs/volumes/vsan:52eba6db0ade8dd9-c04b1d8866d14ce5
Go to terminate state for VM /vmfs/volumes/vsan:52eba6db0ade8dd9-c04b1d8866d14ce5/a57d9a5f-a222-786a-19c8-0c42a162f9d0/YellowBricks.vmx due to APD timeout (CheckCapacity:false)
Failover operation in progress on 1 Vms: 1 VMs being restarted, 0 VMs waiting for a retry, 0 VMs waiting for resources, 0 inaccessible vSAN VMs.

Now for those wondering if it actually works, of course, I tested it a few times and recorded a demo, which can be watched on youtube (easier to follow in full screen), or click play below. (Make sure to subscribe to the channel for the latest videos!)

I hope this helps!

vSAN 7.0 U1 File Services with SMB and NFS support demo

Duncan Epping · Sep 21, 2020 · 11 Comments

I created this quick demo last week, and I figured I would share it here. It shows vSAN 7.0 U1 File Services with SMB and NFS support. I wrote about vSAN File Services and what is new in this post here, make sure to read that as well, and of course, it also details all the other introduced changes for vSAN 7.0 U1.

What’s new for vSAN 7.0 U1!?

Duncan Epping · Sep 15, 2020 · 4 Comments

Every 6-9 months VMware has been pushing out a new feature release of vSAN. After vSphere and vSAN 7.0, which introduced vSphere Lifecycle Manager and vSAN File Services, it is now time to share with you what is new for vSAN 7.0 U1. Again it is a feature-packed release, with many “smaller enhancements, but also the introduction of some bigger functionality. Let’s just list the key features that have just been announced, and then discuss each of these individually. You better sit down, as this is going to be a long post. Oh, and note, this is an announcement, not the actual availability of vSAN 7.0 U1, for that you will have to wait some time.

  • vSAN HCI Mesh
  • vSAN Data Persistence Platform
  • vSAN Direct Configuration
  • vSAN File Services – SMB support
  • vSAN File Services – Performance enhancements
  • vSAN File Services – Scalability enhancements
  • vSAN Shared Witness
  • Compression-only
  • Data-in-transit encryption
  • Secure wipe
  • vSAN IO Insight
  • Effective capacity enhancements
  • Enhanced availability during maintenance mode
  • Faster host restarts
  • Enhanced pre-check for vSAN maintenance mode
  • Ability to override default gateway through the UI
  • vLCM support for Lenovo

[Read more…] about What’s new for vSAN 7.0 U1!?

Host in vSAN cluster with 0 components while other hosts are almost full?

Duncan Epping · Sep 3, 2020 · Leave a Comment

Internally someone just bumped into an issue where a single host in a cluster wasn’t storing any of the created vSAN Components / Objects. It was to the point where every single host in the cluster was close to the maximum of 9000 components, but that one host had 0 components. After some quick back and forth the following message stood out in the UI:

vSAN node decommission state

What does this mean? Well basically it means that from a vSAN stance this host is in maintenance mode. For whatever reason, the host itself from a hypervisor stance was not in maintenance mode, which means that the two were not in sync. This can simply be resolved by SSH’ing into the respective host and running the following command:

localcli vsan maintenancemode cancel

One thing to consider of course is to trigger a rebalance of the cluster after taking the host out of the decommissioned state when the environment is 6.7 U2 or lower, as that would result in a more equally balanced environment. Starting with  6.7 U3 this process is initiated automatically, when configured. There is a KB that describes how to trigger, and/or configure, to be found here.

  • Go to page 1
  • Go to page 2
  • Go to page 3
  • Interim pages omitted …
  • Go to page 90
  • Go to Next Page »

Primary Sidebar

About the author

Duncan Epping is a Chief Technologist in the Office of CTO of the HCI BU at VMware. He is a VCDX (# 007) and the author of multiple books including "vSAN Deep Dive" and the “vSphere Clustering Technical Deep Dive” series.

Upcoming Events

04-Feb-21 | Czech VMUG – Roadshow
25-Feb-21 | Swiss VMUG – Roadshow
04-Mar-21 | Polish VMUG – Roadshow
09-Mar-21 | Austrian VMUG – Roadshow
18-Mar-21 | St Louis Usercon Keynote

Recommended reads

Sponsors

Want to support us? Buy an advert!

Advertisements

Copyright Yellow-Bricks.com © 2021 · Log in