• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

Yellow Bricks

by Duncan Epping

  • Home
  • ESXTOP
  • Stickers/Shirts
  • Privacy Policy
  • About
  • Show Search
Hide Search

7.0

Using HCI Mesh with a stand-alone vSphere host?

Duncan Epping · Apr 13, 2021 · Leave a Comment

Last week at the French VMUG we received a great question. The question was whether you can use HCI Mesh (datastore sharing) with a stand-alone vSphere Host. The answer is simple, no you cannot. VMware does not support enabling vSAN, and HCI Mesh, on a single stand-alone host. However, if you still want to mount a vSAN Datastore from a single vSphere host, there is a way around this limitation.

First, let’s list the requirements:

  1. The host needs to be managed by the same vCenter Server as the vSAN Cluster
  2. The host needs to be under the same virtual datacenter as the vSAN Cluster
  3. Low latency, high bandwidth connection between the host and the vSAN Cluster

If you meet these requirements, then what you can do to mount the vSAN Datastore to a single host is the following:

  1. Create a cluster without any services enabled
  2. Add the stand-alone host to the cluster
  3. Enable vSAN, select “vSAN HCI Mesh Compute Cluster”
  4. Mount the datastore

Note, when you create a cluster and add a host, vCenter/EAM will try to provision the vCLS VM. Of course this VM is not really needed as HA and DRS are not useful with a single host cluster. So what you can do is enable “retreat mode”. For those who don’t know how to do this, or those who want to know more about vCLS, read this article.

As I had to test the above in my lab, I also created a short video demonstrating the workflow, watch it below.

Can I vMotion a VM while IO Insight is tracing it?

Duncan Epping · Mar 4, 2021 · 2 Comments

Today during the Polish VMUG we had a great question, basically, the question was if you can vMotion a VM while vSAN IO Insight is tracing it. I did not know the answer as I had never tried it, so I had to test and validate it in the lab. While testing it became obvious that IO Insight and vMotion are not a supported combination today. Or better said, when you vMotion a VM which has IO Insight enabled and the VM is being traced, then the tracing will stop and you will not be able to inspect the results. When you click on view results you will see the error suggesting that the “monitored VMs might be deleted” as shown below.

For now, if you are tracing a VM for an extended period of time, make sure to override the DRS automation level for that VM so that DRS does not interfere with the tracing. (You can do this on a per VM basis.) I would also recommend informing other administrators to not manually migrate the VM temporarily to avoid the situation where the trace is stopped.  You may wonder why this is the case, well it is pretty simple, tracing happens on a host level. We start a user world on the host where the VM is running to trace the IO. If you move the VM, the user world doesn’t know what has happened to the VM unfortunately. For now, who knows if this is something that may change over time… Either way, I would always recommend not migrating VMs while tracing, as that also impacts the data.

Hope that helps, and thank Tomasz for the great question!

What if the disk controller driver included in my vendor’s ESXi image is not on the vSAN HCL?

Duncan Epping · Jan 15, 2021 · 7 Comments

Sometimes unfortunately there are situations where a vendor’s ESXi image includes a disk controller driver that is not on the vSAN HCL/VCG (VMware Compatibility Guide). Typically it is a new version of the driver which is supported for vSphere, but not yet for vSAN. In that situation, what should you do? So far there are two approaches I have seen customers take:

  1. Keep running with the included driver and wait for the driver to be supported and listed on the vSAN HCL/VCG
  2. Downgrade the driver to the version which is listed on the vSAN HCL/VCG

Personally, I feel that option 2 is the correct way to go. The reason is simple, vSphere and vSAN have different certification requirements for disk controllers and the vSAN certification criteria are just more stringent than vSphere’s. Hence, sometimes you see vSAN skipping certain versions of drivers, this usually means a version did not pass the tests. Now, of course, you could keep running the driver and wait for it to appear on the vSAN HCL/VC. If however, you hit a problem, VMware Support will always ask you first to bring the environment to a fully supported state. Personally, I would not want the extra stress while troubleshooting. But that is my experience and preference. Just to be clear, from a VMware stance, there’s only one option, and that is option two, downgrade to the supported version!

VMs which are not stretched in a stretched cluster, which policy to use?

Duncan Epping · Dec 14, 2020 · 2 Comments

I’ve seen this question popping up regularly. Which policy setting (“site disaster tolerance” and “failures to tolerate”) should I use when I do not want to stretch my VMs? Well, that is actually pretty straight forward, in my opinion, you really only have two options you should ever use:

  • None – Keep data on preferred (stretched cluster)
  • None – Keep data on non-preferred (stretched cluster)

Yes, there is another option. This option is called “None – Stretched Cluster” and then there’s also “None – Standard Cluster”. Why should you not use these? Well, let’s start with “None – Stretched Cluster”. In the case of “None – Stretched Cluster”, vSAN will per object decide where to place it. As you hopefully know, a VM consists of multiple objects. As you can imagine, this is not optimal from a performance point of view, as you could end up having a VMDK being placed in Site A and a VMDK being placed in Site B. Which means it would read and write from both locations from a storage point of view, while the VM would be sitting in a single location from a compute point of view. It is also not very optimal from an availability stance, as it would mean that when the intersite link is unavailable, some objects of the VM would also become inaccessible. Not a great situation. What would it look like? Well, potentially something like the below diagram!

Then there’s “None – Standard Cluster”, what happens in this case? When you use “None – Standard Cluster” with “RAID-1”, what is going to happen is that the VM is configured with FTT=1 and RAID-1, but in a stretched cluster “FTT” does not exist, and FTT automatically will become PFTT. This means that the VM is going to be mirrored across locations, and you will have SFTT=0, which means no resiliency locally. It is the same as “Dual Site Mirroring”+”No Data Redundancy”!

In summary, if you ask me, “none – standard cluster” and “none – stretched cluster” should not be used in a stretched cluster.

vSphere HA reporting not enough failover resources fault with stretched cluster failure scenario

Duncan Epping · Nov 20, 2020 · Leave a Comment

Last few months I had a couple of customers asking why vSphere HA was reporting “not enough failover resources” fault in a stretched cluster failure scenario for virtual machines that are still up and running. Now before I explain why, let’s paint a picture first to make it clear what is happening here. When you run a stretched cluster you can have a scenario where a particular VM (or multiple VMs) are not mirrored/replicated across locations. Now note, with vSAN you can specify for any given VM on a VM level how and if the VM should be available across locations. Typically you would see a VM with RAID-1 across locations, and then RAID-1/5/6 within a location. However, you can also have a scenario where a VM is not replicated across locations, but from a storage point of view only available within a location, this is depicted in the diagram below.

Now in this scenario, when Site A is somehow partitioned from Site B, you will see alarms/errors which indicate that vSphere HA has tried to restart the VM that is located in Site B in Site A and that is has failed as a result of not having enough failover resources.

This, of course, is not the result of not having sufficient failover resources, but it is the result of the fact that Site A does not have access to the required storage components to restart the VM. Basically what HA is reporting is that it doesn’t have the resources which have the ability to restart the impacted VM(s).

Now, if you have paid attention, you will probably wonder why HA tries to restart the VM in the first place, as the VM will still be running in this scenario. Why is it still running? Well the VM isn’t stretched, and this is a partition and not an isolation, which means the isolation response doesn’t kill the VM. So why restart it? Well, as Site A is partitioned from Site B, Site A does not know what the status is of Site B. Site A only knows that Site B is not responding at all, and the only thing it can do is assume the full site has failed. As a result it will attempt a failover for all VMs that were/are running in Site B and were protected by vSphere HA.

Hope that explains why this happens. If you are not sure you understand the full scenario, I recorded a quick five minute video actually walking through the scenario and explaining what happens. You can watch that below, or simply go to youtube.

  • Go to page 1
  • Go to page 2
  • Go to page 3
  • Interim pages omitted …
  • Go to page 5
  • Go to Next Page »

Primary Sidebar

About the author

Duncan Epping is a Chief Technologist in the Office of CTO of the Cloud Platform BU at VMware. He is a VCDX (# 007) and the author of the "vSAN Deep Dive" and the “vSphere Clustering Technical Deep Dive” series.

Upcoming Events

14-Apr-21 | VMUG Italy – Roadshow
26-May-21 | VMUG Egypt – Roadshow
May-21 | Australian VMUG – Roadshow

Recommended reads

Sponsors

Want to support us? Buy an advert!

Advertisements

Copyright Yellow-Bricks.com © 2021 · Log in